WO 2005/042708 PCT/US2004/035636 



^^^^■^5J/ia-27APR 2006 



METHOD OF DESIGNING siRNAS FOR GENE SILENCING 



This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional 
Patent Application No. 60/572314, filed on May 17. 2004, and U.S. Provisional Patent 
5 Application No. 60/515,180, filed on Octobo* 27, 2003, each of virhich is incorporated by 
reference herein in its entirety. 

1. FIELD OF THE INVENTION 

The present invention relates to methods for identifying siRNA target motifs in a 
transcript. The invention also relates to methods for identifying off-target genes of an 
10 siRNA. The invention further relates to methods for designing siRNAs with higher silencing 
efficacy and specificity. The invention also relates to a library of siRNAs comprising 
siRNAs with high silencing efficacy and specificity. 

2. BACKGROUND OF THE INVENTION 

RNA interference (RNAi) is a potent method to suppress gene expression in 
15 manmialian cells, and has generated much excitement in the scientific community (Couzin, 
2002, Science 298:2296-2297; McManus et al., 2002. Nat. Rev. Genet 3, 737-747; Hannon, 
a J.. 2002, Nature 418, 244-251; Paddison et al., 2002, Cancer Cell 2, 17-23). RNA 
interfermce is conserved throughout evolution, from C. elegans to humans, and is believed to 
function in protecting cells from invasion by RNA vimses. When a cell is infected by a 
20 dsRNA virus, the dsRNA is recognized and targeted for cleavage by an RNaselll-type 
enzyme termed Dicer. TTie Dicer enzyme "dices" the RNA into short duplexes of 21nt, 
termed siRNAs or short-mterfimng RNAs, composed of 19nt of perfectly paired 
ribonucleotides with two unpaired nucleotides on the 3' end of each strand. These short 
duplexes associate with a multiprotein complex termed RISC, and direct this complex to 
25 mRNA transcripts with sequence similarity to the siRNA. As a result, nucleases present in the 
RISC complex cleave the mRNA transcript, thereby abolishing expression of the gene 
product. In the case of vkal infection, this mechanism would result in destruction of viral 
transcripts, thus preventing viral synthesis. Since the siRNAs are double-stranded, either 



100005189 





wo 2005/042708 



PCTAJS2004/035636 



10 



15 



20 



25 



strand has the potential to associate with RISC and direct silencing of transcripts with 
sequence similarity. 

Specific gene silencing promises the potential to harness human genome data to 
elucidate gene function, identify drug targets, and develop more specific therapeutics. Many 
of these applications assume a high degree of specificity of siRNAs for their intended targets. 
Cross-hybridization with transcripts containing partial identity to the siRNA sequence may 
elicit phenotypes reflecting silencing of unintended transcripts in addition to the target gene. 
This could confound the identification of the gene implicated in the phenotype. Numerous 
reports in the literature purport the exquisite specificity of siRNAs, suggesting a requirement 
for near-perfect identity with the siRNA sequence (Elbashir et al., 2001. EMBO 7. 20:6877- 
6888; Tuschl et al., 1999, Genes Dev. 13:3191-3197; Hutvagner et al., Sciencexpress 
297:2056-2060). One recent report suggests that perfect sequence complementarity is 
required for siRNA-targeted transcript cleavage, while partial complementarity will lead to 
tranlational repression without transcript degradation, in the maimer of microRNAs 
(Hutvagner et al., Sciencexpress 297:2056-2060). 

The biological function of small regulatory RNAs, including siRNAs and miRNAs is 
not well understood. One prevailing question regards the mechanism by which the distinct 
silencing pathways of these two classes of regulatory RNA are determined. miRNAs are 
regulatory RNAs expressed from the genome, and arc processed from precursor stem-loop 
structures to produce single-stranded nucleic acids that bind to sequences in the 3' UTR of the 
target mRNA (Lee et al., 1993, Cell 75:843-854; Reinhart et al., 2000, Nature 403:901-906; 
Lee et al., 2001, Science 294:862-864; Lau et al., 2001, Science 294:858-862; Hutvagner et 
al., 2001, Science 293:834-838). miRNAs bind to transcript sequences with only partial 
complementarity (Zeng et al., 2002, Molec Cell 9:1327-1333) and repress translation without 
affecting steady-state RNA levels (Lee et al., 1993, Cell 75:843-854; Wightman et al., 1993, 
Cell 75:855-862). Both miRNAs and siRNAs are processed by Dicer and associate vwth 
components of the RNA-induced silencing complex (Hutvagner et al., 2001, Science 
293:834-838; Grishok et al., 2001, Cell 106: 23-34; Ketting et al., 2001, Genes Dev. 15:2654- 
2659; Wmiams et al., 2002, Proc. NatL Acad Sci USA 99:6889-6894; Hammond et al., 
2001, Science 293:1146-1150; Mourlatos et al., 2002, Genes Dev. 16:720-728). A recent 
report (Hutvagner et al., 2002, Sciencexpress 297:2056-2060) hypothesizes that gene 
regulation throu^ die miRN A pathway versus flie siRNA pathway is determmed solely by 
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the degree of complementarity to the target transcript. It is speculated that siRNAs with only 

partial identity to the mRNA target will function in translational repression, similar to an 

miRNA, rather than triggering RN A degradation. 

It has also been shown that siRNA and shRNA can be used to silence genes in vivo. 

5 The ability to utilize siRNA and shRNA for gene silencing in vivo has the potential to enable 
selection and development of siRNAs for thCTapeutic use. A recent report highlights the 
potential therapeutic application of siRNAs. Fas-mediated apoptosis is implicated in a broad 
spectrum of liver diseases, where lives could be saved by inhibiting apoptotic death of 
hepatocytes. Song (Song et al. 2003, Nat, Medicine 9, 347-351) injected mice intravenously 

10 with siRNA targeted to the Fas receptor. The Fas gene was silenced in mouse hepatocytes at 
the mRNA and protein levels, prevented apoptosis, and protected the mice firom hepatitis- 
induced liver damage. Thus, silencing Fas expression holds therapeutic promise to prevent 
liver injury by protecting hepatocytes from cytotoxicity. As another example, injected mice 
intraperitoneally with siRNA targeting TNF-a. Upopolysaccharide-mduced TNF-a gpne 

15 expression was inhibited, and these mice were protected from sepsis. Collectively, these 
results suggest that siRNAs can function in vivo, aad may hold potential as therapeutic drugs 
(Sorensen et al., 2003, J. MoL BioL 327, 761-766). 

Martinez et al. reported that RNA interference can be used to selectively target 
oncogenic mutations (Martinez et al., 2002, Proc. Natl Acad Set USA 99:14849-14854). In 
20 this report, an siRNA that targets the region of the R248W mutant of p53 containing the pomt 
mutation was shown to silence the expression of the mutant p53 but not the wild-type p53. 

Wilda et al. reported that an siRNA targeting the M-BCR/ABL fusion mRNA can be 
used to deplete the M-BCR/ABL mRNA and the M-BRC/ABL oncoprotein in leukemic cells 
(Wilda et al., 2002, Oncogene 21:5716-5724). However, the report also showed that 
25 applying the siRNA in combination with Imatinib, a small-molecule ABL kinase tyrosine 
inhibitor, to leukemic cells did not further increase in the induction of apoptosis. 

U.S. Patent No. 6,506,559 discloses a RNA interference process for inhibiting 
expression of a target gene in a cell The process comprises introducing partially or fiilly 
doubled-stranded RNA having a sequence in the duplex region that is idmtical to a sequence 
30 in the target gene into the cell or into the extracellular environment RNA sequences with 
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insertions, deletions, and single point mutations relative to the target sequence are also found 
as effective for expression inhibition. 

U.S. Patent Application Publication No. US 2002/0086356 discloses RNA 
interference in a Drosophila in vitro system using RNA segments 21-23 nucleotides (nt) in 
5 length. The patent application publication teaches that when these 21-23 nt fragments are 
purijRed and added back to Drosophila extracts, they mediate sequence-specific RNA 
interference in the absence of long dsRNA. The patent application publication also teaches 
that chemically synthesized oligonucleotides of the same or similar nature can also be used to 
target specific mRNAs for degradation in manunaUan cells. 

PCT publication WO 02/44321 discloses that double-stranded RNA (dsRNA) 19-23 
nt in length induces sequence-specific post-transcriptional gene silencing in a Drosophila m 
vitro system. The PCT publication teaches that short interfering RNAs (siRNAs) generated 
by an RNase Ill-like processing reaction from long dsRNA or chemically synthesized siRNA 
duplexes with overhanging 3' ends mediate efficient target RNA cleavage in the lysate, and 
the cleavage site is located near the center of the region spanned by the guiding siRNA. The 
PCT publication also provides evidence that the direction of dsRNA processing determines 
whether sense or antisense-identical target RNA can be cleaved by the produced siRNP 
complex. 

U.S. Patent Application Publication No. US 2002A)16216 discloses a method for 
20 attenuating expression of a target gene in culturisd cells by introducing double stranded RNA 
(dsRNA) that comprises a nucleotide sequence that hybridizes under stringent conditions to a 
nucleotide sequence of the target gene into the cells in an amount sufficient to attenuate 
expression of the target gene. 

PCT publication WO 03/006477 discloses engineered RNA precursors that when 
25 expressed in a cell are processed by the cell to produce targeted small interfering RNAs 
(siRNAs) diat selectively silence targeted genes (by cleaving specific mRNAs) using the 
cell's own RNA interference (RNAi) pathway. The PCT publication teaches that by 
mtroducing nucleic acid molecules that encode these engineered RNA precursors into cells in 
vivo with appropriate regulatory sequences, expression of the engineered RNA precursors can 
30 be selectively controlled both temporally and spatially, i.e., at particular times and/or in 
particular tissues, organs, or cells. 
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Elbashir et al, disclosed a systematic analysis of the length, secondary structure, sugar 
backbone and sequmce specificity of siRNA for RNAi (Elbashir et al., 2001. EMBO J. 
20:6877-6888)* Based on the analysis, Elbashir proposed rules for designing siRNAs. 

Aza-Blanc et al. reported correlations between silencing efficacy and GC content of 
5 the 5' and 3' regions of the 19 bp target sequence (Aza-Blanc et al., 2003, Mol. Cell 12:627- 
637). It was found that siRNAs targeting sequences with a GC rich 5' and GC poor 3' 
perform the best 

Discussion or citation of a reference h^in shall not be construed as an admission that 
such reference is prior art to the present invention. 

10 3. SUMMARY OF THE INVENTION 

In one aspect, the invention provides a method for selecting from a plurality of 
different siRNAs one or more siRNAs for silencing a target gene in an organism, each of the 
plurality of different siRNAs targeting a different target sequence in a transcript of the target 
gene, the method comprising (a) ranking the plurality of different siRNAs according to 

IS positional base compositions of a corresponding targeted sequence motifs in the transoript^ 
wherein each targeted sequence motif comprises at least a portion of the target sequrace of 
the corresponding siRNA and/or a second sequence' in a sequence region flanking the target 
sequence; and (b) selecting one or more siRNAs from the ranked siRNAs. In a preferred 
embodiment, each sequmce motif comprises the target sequence of the targeting siRNA. In 

20 another embodiment, the ranking step is carried out by (al ) determining a score for each 
different siRNA, wherein the score is calculated using a position-specific score matrix; and 
(a2) ranking the plurality of different siRNAs according to the score. 

In one embodiment, each sequence motif is a nucleotide sequence of L nucleotides, L 
being an integer, and the position-specific score matrix is {logCe,//?,;/)}, where etj is the weight 
25 of nucleotide / at position 7, pij is the weight of nucleotide i at position / in a random 

sequmce, and 1 = G, C, A, U(T), j-l,...,L. In another embodiment, each sequence motif 
is a nucleotide sequence of L nucleotides, L being an integer, and the position-specific score 
matrix is {log(e/ypiy)}, where eij is the weight of nucleotide / at position j, pij is the weight of 
nucleotide i at position; in a random sequence, and i -GotQA U(T)^ j = 1, L. 

30 In one embodiment, the score for each siRNA is calculated according to equation 
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L 

Score ^^ln(e, / p,) 

wherein et and pt are respectively weights of the nucleotide at position / ixi die sequence motif 
as determined according to the position-specific score matrix and in a random sequence. 

In another embodiment, each sequence motif comprises ihe target sequence of the 

« 

5 targeting siRNA and at least one flanking sequrace. Preferably, each sequence motif 
comprises the target sequence of the targeting siRNA and a 5' flanking sequence and a 3' 
flanking sequence, hi one embodiment, the 5' flanking sequence and the 3' flanking sequence 
are each a sequence of D nucleotides, D being an integer. In a specific embodiment, each 
target sequence is a sequence of 19 nucleotides, and each 5' flanking sequence and 3' flanking 

iO sequence are a sequence of 10 nucleotides. In anotho: specific embodiment, each target 
sequence is a sequence of 19 nucleotides, and each S' flanking sequence and 3' flanking 
sequence are a sequence of 50 nucleotides. 

Prefa-ably, the one or more siRNAs consist of at least 3 siRNAs. In anodier 
embodiment, the method further comprises a step of de-overlapping, comprising selecting a 

15 plurality of siRNAs among the at least 3 siRNAs such that siRNAs in the plurality are 
sufficiently different m a sequence diversity measure. In one embodiment, the diversity 
measure is a quantifiable measure, and the selecting ia the de*overlapping step comprises 
selecting siRNAs having a difference in the sequence diversity measure between different 
selected siRNAs above a given threshold. In one embodiment, die sequence diversity 

20 measure is the overall GC content of the siRNAs. In one embodiment, die givm threshold is 
5%. In another embodiment, the sequence diversity measure is the distaace between siRNAs 
along the length of the transcript sequence. In one embodiment, the threshold is 100 
nucleotides. In still another embodiment, the sequence diversity measure is the identity of die 
leading dimer of the siRNAs, wherein each of the 16 possible leading dimers is assigned a 

25 score of 1-16, respectively. In one embodiment, the threshold is 0.5. 

In another embodiment, the method further comprises a step of selecting one or more 
siRNAs based on silencing specificity, the step of selecting based on silencing specificity 
comprising, (i) for each of the plurality of siRNAs, predicting off-target genes of the siRNA 
from among a plurality of genes, wherein die off-target genes are genes other than the target 
30 gene and arc directly silenced by the siRNA; (ii) ranking the plurality of siRNAs according to 
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flxeir respective numbers of off-target genes; and (iii) selecting one or more siRNAs for which 
the number of off-target genes is below a givM threshold. 

In one embodiment, the predicting comprises (il) evaluating the sequence of each of 
the plurality of genes based on a predetermined siRNA sequence match pattern; and (i2) 
5 predicting the gene as an off-target gene if the gene comprise a sequence that matches the 
siRNA based on the sequence match pattern. In one embodiment, the step of evaluating 
comprises identifying an alignment of the siRNA to a sequence in a gene by a low stringency 
FastA alignment. 

In one embodiment, each siRNA has L nucleotides in its duplex, region, and the match 
10 pattern is represented by a position matdi position-specific score matrix (pmPSSM), the 
position match position-specific score matrix consisting of weights of different positions in 
an siRNA to match transcript sequence positions in an off-target transcript {P,}, where j = i, 
L, Pj is the weight of a match at position j. 

In another embodknent, &e step (il) comprises calculating a position match score 
15 pmScore according to equation 



L 

/»»i5cor^=£to(£,/0.25) 



where £/= Pi if position i is a match and = (l-Pr)/3 if position i is a mismatch; and the stq> 
(i2) comprises predictmg the gene as an off-target gene if die position match score is greater 
than a given threshold. 

20 In a preferred embodiment, L is 19, and the pmPSSM is given by Table I, 

Preferably, the plurality of genes comprises all known unique genes of the organism 
odier than the target gene. 

In one embodiment, the position-specific score matrix (PSSM) is determined by a 
method comprising (aa) identifying a plurality of N siRNAs consisting of siRNAs having 19- 
25 nucleotide duplex region and having a silencing efficacy above a chosen threshold; (bb) 
identifying for eadi siRNA a functional sequence motif, the functional sequence motif 
comprising a 19-nucleotide target sequence of ttie siRNA and a 10-nucleotide 5' flanking 
sequence and a 10-nucleotide 3' flanking sequence; (cc) calculating a firequency matrix {/i;/}. 
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where i = G, C, A, U(T); j = i, 2, L, and where is the firequency of the tth nucleotide at 
the/th position, based on the siRNAs functional sequence motifs according to equation 



where ( /) = i ^ ^ , and (d) determining the PSSM by calculating eg according to 
[0,if k^i 

equation 



" N 

In another embodiment, the position-specific score matrix (PSSM) is obtained by a 
method comprising (aa) initializing the PSSM with random weights; (bb) selecting randomly 
a weight Wij obtained in (aa); (cc) changing the value of the selected weight to generate a test 
10 psPSSM comprising the selected weight having the changed value; (dd) calculating a score 
for each of a plurality of siRNAs functional sequence motifs using the test PSSM according 
to equation 



L 

Score = ]^ln(vvit ^Pk) 
fc=) 



wherein the Wk and pk are respectively weights of a nucleotide at position * in the functional 
15 sequence motif and in a random sequence; (ee) calculating correlation of the score and a 
metric of a characteristic of an siRNA among the plurality of siRNAs functional sequence 
motifs; (fi) repeating steps (cc)-(ee) for a plurality of different values of the selected weight 
in a given range and retain die value that corresponds to the best correlation for the selected 
weight; and (gg) repeating steps (bbHS) for a chosen number of times; thereby detennining 
20 the PSSM. 

hi one embodiment, the method further comprises selecting the plurality of siRNA 
functional sequence motifs by a method comprising (i) identiifying a plurality of siRNAs 
consisting of siRNAs having dififerent values in the metric; (ii) identifying a plurality of 
siRNA functional sequence motifs each corresponding to an siRNA in the plurality of 
25 siRNAs. Jn a prefeired embodiment, the characteristic is silencing efficacy. 
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In one embodiment, the plurality of N siRNAs target a plurality of different genes 
having different transcript abundances in a cell. 

In one embodiment, step (b) is carried out by selecting one or more siRNAs havixig 
the highest scores. In another embodiment, step (b) is carried out by selecting one or more 
5 siRNAs having a score closest to a predetermined value, wherein the predetermined value is 
the score value corresponding to the maximum median silencing efficacy of a plurality of 
siRNA sequence motifs. In a preferred embodiment, the plurality of siRNA sequence motifs 
are sequence motifs in transcript having abundance level of less than about 3-5 copies per 
cell. 

10 In another embodiment, step (b) is carried out by selecting one or more siRNAs 

having a score within a predetemuned range, wherein the predetermined range is a score 
range corresponding to a plurality of siRNAs sequence motifs havmg a given level of 
silencing efficacy. In one embodiment, the silencing efQcacy is above 50%, 75%, or 90% at 
an siRNA dose of about lOOnM. 

15 In a preferred embodiment, the plurality of siRNA sequence motifs are sequence 

motifs in transcript having abundance level of less than about 3-S copies per cell. 

In another preferred embodiment, the plurality of N siRNAs comprises at least lO, 50, 
100, 200, or 500 different siRNAs. 

In another embodiment, the position-specific score matrix (PSSM) comprises k 
20 =i, . . I> Wk being a difference in probability of finding nucleotide G or C at sequ^ce 

position k between a first type of siRNA and a second type of siRNA, and the score for each 
strand is calculated according to equation 

L 

Score^^Wf. . 

In one embodiment, the furst type of siRNA consists of one or more siRNAs having 
25 silencing efficacy no less than a first threshold and the second type of siRNA consists of one 
or more siRNAs having silencing efficacy less than a second threshold 
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In one embodLment, the difference in probability is described by a sum of Gaussian 
curves, each of the Gaussian curves representing the difference in probability of finding a G 
or C at a different sequence position . 

In one embodiment, the fiist and second threshold are both 75% at an siRNA dose of 
lOOnM. 

In another aspect, the invention provides a mediod for selecting from a plurality of 
different siRNAs one or more siRNAs for silencing a target gene in an organism, each of the 
plurality of different siRNAs targeting a different target sequence in a transcript of the target 
gene, the method comprising (a) ranking the plurality of different siRNAs according to 
positional base composition of reverse complement sequences of sense strands of the 
siRNAs; and (b) selecting one or more siRNAs from the ranked siRNAs. 

In one embodunent, the ranking step is carried out by (al) determining a score for 
eadi different siRNA, wh^in the score is calculated using a position-specific score matrix; 
and (a2) ranking the plurality of diff^nt siRNAs according to the score. 

In one embodiment, the siRNA has a nucleotide sequence of L nucleotides in its 
duplex region, L being an integer, wherein the position-specific score matrix comprises wjt, k 
=i, Wk being a difference in probability of finding nucleotide G or C at sequence 
position k between reverse coraplemait of sense strand of a first type of siRNA and reverse 
complement of sense strand of a second type of siRNA, and the score for each reverse 
complement is calculated according to equation 

L 

Score ^^Wj^ . 

In one embodiment, the first type of siRNA consists of one or more siRNAs having 
silencing efficacy no less than a first threshold and the second type of siRNA consists of one 
or more siRNAs having silencing efficacy less than a second threshold. 

In another embodiment, the difference in probability is described by a sum of 
Gaussian curves, each of the Gaussian curves representing the difference in probability of 
finding a G or C at a different sequence position . 
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In one embodiment, the first and second threshold are both 75% at an siRNA dose of 



lOQnM. 

hi still another aspect, the invention provides a mefliod for selecting from a plurality 
of different siRNAs one or more siRNAs for silencing a target gene in an organism, each of 
the plurality of different siRNAs targeting a different target sequence in a transcript of the 
target gene, the method comprising, (i) for each of the plurality of different siRNAs, 
predictmg off-target genes of the siRNA from among a plurality of genes, wherein the off- 
target genes are genes other than the target gene and are directly silenced by the siRNA; (ii) 
ranking the plurality of different siRNAs according to the number of off-target genes; and 
(iii) selecting one or more siRNAs for which the number of off-target genes is below a given 
threshold 

In one embodiment, the predicting comprises (il) evaluating the sequence of each of 
the plurality of genes based on a predetermined siRNA sequence match pattern; and (i2) 
predicting a gene as an off-target gene if the gene comprise a sequence that matches the 
siRNA based on the sequence match pattern. 

In one embodiment, each siRNA has L nucleotides in its diq)lex region, and the 
sequence match pattern is represented by a position matdi position-specific score matrix 
(pmPSSM), the position match position-specific score matrix consisting of weights of 
different positions in an siRNA to match transcript sequence positions in an off-target 
transcript {Pj}, where j^l,..., L, Pj is the weight of a match at position;. 

In another embodiment, the step (il) comprises calculating a position match score 
pmScore according to equation 

L 

pmScore = ^]n(Ei /0.25) 

where Ei- Pi if position i is a match and Et » (l-/'i)/3 if position i is a mismatch; and the step 
(i2) comprises predicting the gene as an off-target gene if Qie position match score is greater 
than a given threshold. 

In a preferred embodunent, L is 19, and the pmPSSM is given by Table L 
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In one embodiment, the plurality of genes comprises all known unique genes of the 
organism other than the target gene. 

hi still another aspect, the invention provides a library of siRNAs, conq}rising a 
plurality of siRNAs for each of a plurality of different genes of an organism, wherein each 
5 siRNA achieves at least 75%, at least 80%, or at least 90% silencmg of its target gene. In one 
embodiment, the plurality of siNRAs consists of at least 3, at least S, or at least 10 siRNAs. 
In another embodiment, ±e plurality of different genes consists of at least 10, at least 100, at 
least 500, at least 1,000, at least 10,000, or at least 30,000 different genes. 

In still another aspect, the invention provides a method for determining a base 
10 composition position-specific score matrix (bsPSSM) {log(€ij/p^)] for representing base 
composition patterns of siRNA functional sequence motifs of L nucleotides in transcripts, 
wherein i = G, C, A, U(T) and^ = i, 2, L, and each siRNA functional sequence motif 
comprises at least a portion of the target sequence of the corresponding targeting siRNA 
and/or a sequence in a sequence region flaiddng the target sequence, the method comprising 
15 (a) identifying a plurality of N different siRNAs consisting of siRNAs having a silencing 
efficacy above a chosen threshold; (b) identifying a plurality of corresponding siRNA 
functional sequence motifs, one for each different siRNA; (c) calculating a frequency matrix 
{/(/}, where i = G, C, A, U(T);j = i, 2, L, and where is the frequency of the ith 
nucleotide at the 7th position, based on the plurality of N siRNAs functional sequence motifs 
20 according to equation 

(l,if Ic ~~ J 
. , and (d) determming die psPSSM by calculating etj according to 
0,if Jc 

equation 



" N 
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In one embodiment, each siRNA functional motif comprises the target sequence of 

the corresponding targeting siRNA and one or both flanking sequences of the target 

sequence. 

In one embodiment, each siRNA has M nucleotides in its duplex region, and each 
siRNA functional sequence motif consists of an siRNA target sequence of M nucleotides, a 5' 
flanking sequence of Dj nucleotides and a 3' flanking sequence of Dz nucleotides. 

In a specific embodiment, each siRNA has 19 nucleotides in its duplex region, and 
eadh siRNA functional sequence motif consists of an siRNA target sequence of 19 
nucleotides, a S' flanking sequence of 10 nucleotides and a 3' flanking sequence of 10 
nucleotides, in anodier specific enibodiment, each siRNA has 19 nucleotides in its duplex 
region, and each siRNA functional sequence motif consists of an siRNA target sequence of 
19 nucleotides, a S' flanking sequence of 50 nucleotides and a 3' flanking sequence of SO 
nucleotides. 

In one embodiment, the plurality of N siRNAs each targets a gene whose transcript 
abundance is witfam a given range. In one embodiment, the range is at least about S, 10, or 
100 transcripts per cell. In another embodiment, the range is less than about 3-5 transcripts 
p^celL 

In another embodiment, the silencing threshold is 50%, 75%, or 90% at an siRNA 
dose of about lOOnM. In still another embodimrat, the plurality of N siRNAs comprises 10, 
50, 100, 200, or 500 different siRNAs. 

In still another aspect, the invention provides a method for determining a base 
composition position-specific score matrix (bsPSSM) { Wy } for representing a base 
composition pattern representing a plurality of different siRNA functional sequence motifs of 
L nucleotides, wherein / = G, C A, U(T) and j = i, 2, L, and each siRNA functional 
sequence motif comprises at least a portion of the target sequence of the corresponding 
targeting siRNA and/or a sequence m a sequence region flanking the siRNA target sequence, 
the method comprising (a) initializing the bsPSSM with random weights;.(b) selecting 
randomly a weight obtained in (a); (c) changing the value of the selected weight to 
generate a test psPSSM comprising the selected weight having the changed value; (d) 
calculating a score for each of the plurality of siRNAs functional sequence motifs using the 
test psPSSM accordmg to equation 
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L 

Score = ^ln(Wjfc /p^) 

wherein the and pt are respectively weights of a nucleotide at position A: in the functional 
sequence motif and in a random sequence; (e) calculating correlation of the score and a 
metric characterizing an siRNA among the plurality of siRNAs functional sequence motifs; 
5 (f) repeating steps (c)-(e) for a plurality of different values of the selected weight in a given 
range and retain the value that corresponds to the best correlation for die selected weight; and 
(g) repeating steps (b)-(f) for a chosen number of times; Aereby determining the psPSSM. 

The invention also provides a method for detennaining a base composition position- 
specific score matrix (bsPSSM) [wij] for representing a base composition pattern 

10 representing a plurality of di£Ferent siRNA functional sequence motifs of L imcleotides, 
wherein i = G/C, A, U(T) and j = /, 2, L, and each siRNA functional sequence motif 
comprises a least a portion of the target sequence of the corresponding siRNA and/or a 
sequence in a sequence region flanking the siRNA target sequence, the method comprising 
(a) initializing the bsPSSM with random weights; (b) randomly selecting a weight Wij 

15 obtained in (a); (c) changing the value of the selected weight to generate a test psPSSM 
comprising the selected weight having the changed value; (d) calculating a score for each of 
the plurality of siRNA functional sequence motifs using the test psPSSM according to 
equation 

L 

Score = ]^ln(Wjt / p^) 

20 wherein the Wk and pk are respectively wei^ts of a nucleotide at position in the functional 
sequence nK>tif and in a random sequence; (e) calculating a correlation of the score and a 
metric of a characteristic of an siRNA among the plurality of siRNAs functional sequence 
motifs; (f) repeating steps (cHe) for a plurality of different values of the selected weight in a 
given range and retain the value that corresponds to the best correlation for the selected 

25 weight; and (g) repeating steps (b)-(f) for a chosen numb^ of times; th^eby determining the 
psPSSM. 

In one embodiment, each siRNA functional motif comprises the target sequence of 
the corresponding targeting siRNA and one or both flanking sequences of the target 
sequence. 
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In another embodiment, the method further comprises selecting the plurality of siRNA 
functional sequence motifs by a method comprising (i) identifying a plurality of siRNAs 
consisting of siRNAs having different values in the metric; (ii) identifying a plurality of 
siRNA functional sequence motifs each corresponding to an siRNA in the plurality of 
' 5 siRNAs, 

In one embodiment, each siRNA has M nucleotides in its duplex region, and each 
siRNA functional sequence motif consists of an siRNA target sequence of M nucleotides, a 5' 
flanking sequence of Dj nucleotides and a 3' flanking sequence of Dz nucleotides. 

In a specific embodiment, each siRNA has 19 nucleotides in its duplex region, and 
10 each siRNA functional sequence motif consists of an siRNA target sequence of 19 
nucleotides, a S' flanking sequence of 10 nucleotides and a 3' flanking sequence of 10 
nucleotides. In another specific embodiment, each siRNA has 19 nucleotides in its duplex 
region, and each siRNA functional sequence motif consists of an siRNA target sequence of 
19 nucleotides, a 5' flanking sequence of 50 nucleotides and a 3' flanking sequence of 50 
15 nucleotides. 

In one enibodiment, the metric is silencing efficacy. 

In one embodiment, the plurality of N siRNAs each targets a gene whose transcript 
abundance is within a given range. In one embodiment, the range is at least about 5, 10, or 
100 transcripts per cell. In another embodiment, the range is less than about 3-5 transcripts 
20 per cell. In another embodiment, the threshold is 50%, 75%, or 90% at an siRNA dose of 
about lOOnM. 

In another embodiment, the method further comprises evaluating the psPSSM using 
an ROC (receiver operating characteristic) curve of the sensitivity of the psPSSM vs. the non- 
specificity of the psPSSM curve, the sensitivity of the PSSM being the proportion of true 
25 positives detected using the psPSSM as a fraction of total true positives, and the non- 
specificity of the PSSM being the proportion of false positives detected using the psPSSM as 
a fraction of total false positives. 

In one embodiment, the plurality of siRNA functional sequence motifs consists of at 
least 50, at least 100, or at least 200 different siRNAs functional sequence motifs. 
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In Still another embodiment, the method further comprises testing the psPSSM using 

another plurality of siRNA functional sequence motifs. 



The invention also provides a method for determining a position match position- 
specific score matrix (pmPSSM) {£/} for representing position match pattern of an siRNA of 
L nucleotides with its target sequence in a transcript, wherein Ei is a score of a match at 
position i, 1-1,2, .,.,L, the method comprising (a) identifying a plurality of N siRNA off- 
target sequences, wherein each off-target sequence is a sequence on which the siRNA 
exhibits silencing activity; (b) calculating a position match weight matrix {P,}, where / = 1, 
2, . . based on the plurality of N siRNAs off-target sequences according to equation 

where S^Q) is 1 if ik is a match, and is 0 if A: is a mismatch; and (c) determinmg the psPSSM 
by calculating Ei such that £,= ?/ if position i is a match and Ei = {l-Pi)/3 if position i is a 
mismatch. 

In a preferred embodiment, L = 19. In anodier preferred embodiment, the position 
match wei^t matrix is given by Table I. 

The invention also provides a method for evaluatmg flie relative activity of the two 
strands of an siRNA in off-target gene silencing, comprising comparing position specific base 
composition of the sense strand of the siRNA and position specific base composition of the 
antisense strand of the siRNA or reverse complement strand of the sense strand of the siRNA, 
wherein the antisense strand is the guiding strand for targeting die intended target sequence. 

In one embodiment, the comparing is carried out by a method comprising (a) 
determining a score for the sense strand of the siRNA, wherein the score is calculated using a 
position-specific score matrix; (b) determining a score for the antisense strand of the siRNA 
or the reverse complement strand of the sense strand of the siRNA using the position-specific 
score matrix; and (c) comparing the score for the sense strand and the score for the antisense 
strand or the reverse complement strand of the sense strand, thereby evaluating strand 
preference of the siRNA. 
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In one embodiment, the siRNA has a nucleotide sequence of L nucleotides in its 
duplex region, L being an integer, wherein the position-specific score matrix is {w^,}, where 
is the wei^t of nucleotide i at position j, i = G, C A, U{T)^ 1, L. 

In another embodiment, the siRN A has a nucleotide sequence of L nucleotides in its 
5 duplex region, L being an integer, and the position-specific score matrix is {Wy }, where is 
the weight of nucleotide i at position/, i^Gor C, A, U{T), L. 

In another embodiment, the position-specific score matrix is obtained by a method 
comprising (a) initializmg the position-specific score matrix with random weights; (b) 
selecting randomly a weight Wij obtained in (a); (c) dianging the value of the selected weight 
10 to generate a test position-specific score matrix comprising the selected weight having the 
changed value; (d) calculating a score for each of a plurality of siRNAs using the test 
position-specific score matrix according to equation 

L 

Score = ^ Id{Wj / Pj) 

whetein the wj and pj are respectively weights of a nucleotide at position j in the siRNA and 
15 in a random sequence; (e) calculating correlation of the score with a metric of a characteristic 
of an siRNA among the plurality of siRNAs; (f) repeating steps (c)-(e) for a plurality of 
different values of the selected wei^t in a ^ven range and retain the value that corresponds 
to the best correlation for the selected wei^t; and (g) repeating steps (b)-(f) for a chosen 
number of times; thereby determining the position-specific score matrix. 

20 In one embodiment, the metric is siRNA silencing efficiency. 

In one embodiment, Oie siRNA has 19 nucleotides ui its duplex region. 

In another embodiment, the siRNA has a nucleotide sequence of L nucleotides in its 
duplex region, L being an integer, wherein the position-specific score matrix comprises wt* k 
=1, I> Wk being a difference in probability of finding nucleotide G or C at sequence 
25 position k between a first type of siRNA and a second type of siRNA, and the score for each 
strand is calculated according to equation 

L 

Score = ^Wf^ . 
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In one embodiment, the first type of siRNA consists of one or more siRNAs having 
silencing efficacy no less than a first threshold and the second type of siRNA consists of one 
or more siRNAs having silencing efficacy less than a second threshold, and the siRNA is 
determiaed as having antisense preference if the score determined in step (a) is greater than 
5 the score determined in step (b), or as having sense preference if the score detenmined in step 
(b) is greater than the score determined in step (a). 

In another embodiment, the difference in probability is described by a sum of 
Gaussian curves, eadi of the Gaussian curves representing the difference in probability of 
finding a G or C at a different sequence position . 

10 In one embodunent. the first and second threshold are both 75% at an siRNA dose of 

about lOQnM. 

In still another aspect, the invention provides a computer system comprising a 
processor, and a memory coupled to the processor and encoding one or more programs, 
wherem the one or more programs cause the processor to carry out any one of the method of 
15 the invention. 

In still another aspect, the invention provides a computer program product for use in 
conjunction with a computer having a processor and a memory connected to the processor, 
the computer program product comprising a computer readable storage medium having a 
computer program mechanism encoded thereon, wherein the computer program mechanism 
20 may be loaded into the memory of the computer and cause the computer to carry out any one 
of the mediod of the invention. 

4. BRIEF DESCRIPTION OF FIGURES 

FIGS. lA-C show that base composition in and around an siRNA targ^ sequence 
affects the silencing efficacy of the siRNA. A total of 377 siRNAs were tested by Taqman 
25 analysis for their ability to silence their target sequences 24hr following transfection into 
HeLa cells. Median target silencing was -75%. This dataset was divided into two subsets, 
one having less than median and one having equal to or greater fhm median silencing ability 
(refOTed to as "bad" and "good" siRNAs, respectively). Shown here are the mean difference 
within a window of 5 (i.e., averaged over all 5 bases) in GC content (FIG. 1 A), A content 
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(FIG, IB), and U content (FIG. IC) between good and bad siRNAs at different relative 

positions on a target sequence. 

FIGS. 2A-C (A) GC content of good and bad siRNAs; (B) A content of good and bad 
siRNAs; (C) U content of good and bad siRNAs. Hie figures show average compositions of 
5 each base. For example, 0.5 on the y-axis corresponds to an average base content of 50%. 

FIG. 3 shows the performance of an actual siRNA base composition model used in 
the siRNA design method of the invention. siRNA efiScacy data were subdivided into two 
pairs of traming and test sets. Different PSSMs were optimized on each of the training sets 
and verified on the test sets. The performance of each PSSM was evaluated by its ability to 

10 distinguish good siRNAs (true positives) and bad siRNAs (false positives) as an increasing 
number of siRNAs were selected from a list ranked by PSSM score. Shown are Receiver 
Operating Characteristics (ROC) curves demonstrating the performance of two difiTerent 
PSSMs on their respective training and test sets (heavy black and dotted gray lines, 
respectively). The expected performance of the PSSMs on randomized data is shown for 

15 comparison (i.e., no improvement in selection ability, 45° line). 

FIG. 4 demonstrates the predictive ability of PSSMs on an independent experimental 
data set. New siRNAs were designed for five genes by the standard method as described in 
Elbashir et al., 2001, Nature 41 1:494-8, with the addition of the specificity prediction method 
disclosed in this application, aiul by the PSSM based efficacy and specificity prediction 

20 method of the invention. The top three ranked siRNAs per gene were selected for each 

method and purchased from Dharmacon. All six siRNAs for each of the five genes were then 
tested for their ability to silence their target sequences. Shown is a histogram of the number 
of siRNAs diat silence their respective target genes by a specified amount Solid curve, 
silencing by siRNAs designed by the present method; dashed curve, silencing by siRNAs 

25 designed by the standard method; dotted gray curve, silencing by the data set of 377 siRNAs. 

FIGS. SA-C show mean weights of GC, A or U from the two ensembles of base 
composition PSSM trained and tested with siRNAs in set 1 and set 2, respectively. FIG. 5A 
mean weights for GC, FIG. SB mean weights for A, FIG. SC mean weights for U. siRNAs in 
set 1 and set 2 are shown in Table n. 

30 FIG. 6 shows an example of alignments of transcripts of off-target genes to the core 

19mer of an siRNA oligo sequence. Off-target genes were selected from die Human 2Sk 
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v2.2.1 microarray by selecting for kinetic patterns of transcript abundance consistent witb 

direct effects of siRNA oligos. The left hand column lists transcript sequence identifiers. 

Alignments were generated with FASTA and edited by hand The black boxes and grey area 

demonstrate the higher level of sequence similarity in the 3' half of the alignment 

5 FIG. 7 shows a position match position-specific scoring matrix for predicting off- 

target effects. The chart shows the weight associated with each position in a matrix 
representing the alignment between an siRNA oligo and off-target transcripts. The weight 
represents the probability that a match will be observed at each position i along an alignment 
between an siRNA oligo and an observed off-target transcript. 

10 FIG. 8 shows optimization of the threshold score for predicting off-target effects of 

siRNAs, The R^ values result from the correlation of number of alignments scoring above 
the threshold with number of observed off-target effects. 

FIG. 9 shows a flow chart of an exemplary embodiment of the method for selecting 
siRNAs for use in silencing a gene. 

15 FIG. 10 illustrates sequence regions that can be used for distinguishing good and bad 

siRNAs. PSSMs were trained on chunks of sequence 10+ bases in length, from 50 bases 
upstream to 50 bases downstream of the siRNA 19mer, and tested on independent test sets. 
The performance of models trained on chunks of interest was compared with models trained 
on random sequaices. Position I corresponds to the first 5' base in the duplex region of a 21 

20 nt siRNA. 

FIGS. 1 1 A-B shows curve models for PSSM. 1 1 A: an exemplary set of curve models 
for PSSM. UB: the performance of the models on training and test sets. 

FIG. 12 illustrates an exemplary embodiment of a computer system useful for 
implemratiQg the methods of the presmt invention. 

25 FIG. 13 shows a comparison of the distribution of silencing efficacies of the siRNAs 

among the 30 siRNAs designed using the method of the invention (solid circles) and siRNAs 
designed using die standard method (opra circles), x-axis: 1, KIF14; 2, PLK; 3, IGFIR; 4, 
MAPK14; 5, KIFll. y-axis: RNA level. The siRNAs designed using the standard method to 
the 5 genes exhibited a broad distribution of silencing abilities, while ttiose designed with the 
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method of the invention show more consistent silencing within eadi gene, as well as across 

genes. A nanow distribution is very important for functional genomics with siRNAs. 



FIGS. 14A-B show a comparison of flie GC content of siRNAs and their reverse 
complements with the GC content of bad siRNAs. The results indicate that bad siRNAs have 
sense strands similar to good siRNAs, while good siRNAs have sense strands similar to bad 
siRNAs. RC: reverse complement of the siRNA target sequence. 

FIG. IS shows that less effective siRNAs have active sense strands. Strand bias of 61 
siRNAs was predicted from expression jKofiles by the 3'-biased method, and firom 
comparison of the GC PSSM scores of the siRNAs and their reverse complements. Strand 
bias predictions were binned by siRNA silencing efficacy. 

FIG. 16 shows that silencing efficacy relates to transcript expression level. A total of 
222 siRNAs (3 siRNAs per gene for 74 genes) were tested by bDNA or Taqman analysis for 
dieir ability to silence their target sequences 24hr following transfection into HeLa cells. 
Percent silencing (y-axis) was plotted as a function of transcript abundance (x-axis) measured 
as iatensity on micnDarray. Shown is the median target silencing observed for 3 siRNAs per 
gene selected by the previous siRNA design algorithm. The dependence of silencing on gene 
expression level, as the average of intensities from 2 array types, is shovm for 74 genes. 
TaqMan assays were used for 8 genes. b-DNA data is shown for the remaining 66 genes. 

FIG. 17 shows feat the silmcing efficacy of an siRNA relates to its base composition. 
siRNAs to poorly-expressed gmes were tested by bDNA analysis for their ability to silence 
their target sequences. Data were divided into subsets having less than 75% silencing and 
equal to or greater than 75% silencing (bad and good siRNAs, respectively). Shown here is 
the difference in GC content between good and bad siRNAs (y-axis) at each position m the 
siRNA sense strand (x-axis.) The dataset includes both poody-expressed and highly- 
expressed graes ftom 570 siRNAs selected to 33 poorly- and 41 highly-expressed genes by 
Tusdil rules or randomized selection. The siRNA sequences are listed in Table IV. TheGC 
profile for good siRNAs to poorly-expressed genes (gray dotted curve) shows some similar 
composition preferences to good siRNAs for well-expressed genes (black curve), but also 
some differences. 

FIG. 18 shows the efficacy of newly design siRNAs. siRNAs were designed for 18 
poorly-expressed genes by the standard method and by the new algorithm. Standard pipeline: 
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selection for maximum pssm score; minimax filter for long off-target matches. Improved 
pipeline: selection for 1-3 G+C in sense 19mer bases 2-7, base 1 & 19 asymmetry, -300 < 
pssm score < +200, and blast matches less than 16, 200 bases on either side of the 19mer are 
not repeat or low-complexity sequences. The top three ranked siRNAs per gene were selected 
5 for each method. All six siRNAs for each of the five genes were then tested for their ability 
to silence their target sequences. Shown is a histogram of the number of siRNAs silencing 
their target genes by a specified amount Dotted curve, silencing by siRNAs designed by the 
new algorithm; solid curve, silencing by siRNAs designed by the standard method. Median 
silencing inq^roved from 60% (standard algorithm) to 80% (new algorithm). 

10 FIG. 19. Design features of efficacious siRNAs. Studies of design criteria that 

correlate with siRNA silencing efficacy have revealed a number of features that predict 
efficacy. These include a base asymmetry at the two termini to direct the antisense (guide) 
strand into RISC, a U at position 10 for effective cleavage of the transcript, a low GC stretch 
encompassing the center and 3' end of the guide strand for enhanced cleavage, and the "seed" 

15 region at the S' end of the antisense strand implicated in transcript binding. Gray lines above 
the duplex indicate sequence preferences, li^t gray lines below the duplex indicate 
functional attributes. 

FIG. 20 shows expression vs. median silencing in 371 siRNAs. These are siRNAs 
from the original training set of 377 siRNAs. 6 siRNAs were not included in the analysis, as 
20 the expression level of their target gene was not available. 

s nFTAn ^n d escription of the invention 

The present invention provides a method for identifying siRNA target motifs m a 
transcript using a position-specific score matrix approach. The invention also provides a 
method for idratif ying off-target genes of an siRNA and for prediaing specificity of an 
25 SiRNA using a position-specific score matrix approach. The invention further provides a 
method for designing siRNAs with higher sil^cing efficacy and specificity. The invention 
also provides a library of siRNAs comprising siRNAs with high silencing efficacy and 
specificity. 

In this application, an siRNA is often said to target a gene. It will be understood that 
30 when such a statement is made, it means that the siElNA is designed to target and cause 

degradation of a transcript of the gene. Such a gene is also referred to as a target gene of the 
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siRNA, and the sequence in the transcript that is acted upon by the siRNA is referred to as the 

target sequence. For example, a 19-aucleotide sequence in a transcript which is identical to 

the sequence of the 19-nucleotide sequence in the sense strand of the duplex region of an 

siRNA is the target sequence of the siRNA. The antisensc strand of the siRNA, i.e., the 

5 strand that acts upon the target sequence, is also referred to as the guiding strand. In the 
above example, the antisense strand of the 19*nucleotide duplex region of the siRNA is the 
guiding strand. In diis application, features of an siRNA are often referred to with reference 
to its sequence, e.g., positional base composition. It will be understood that, unless 
specifically pointed out otherwise, such a reference is made to the sequence of the sense 

10 strand of the siRNA. In this application, a nucleotide or a sequence of nucleotides in an 
siRNA is often described with reference to the 5' or 3' end of flie siRNA. It will be 
imderstood that when such a description is employed, it refers to the 5' or 3' end of the sense 
strand of die siRNA. It will also be understood that, when a reference to the 3' end of the 
siRNA is made, it refers to the 3' duplex region of the siRNA, ue., the two nucleotides of the 

15 3' overhang are not included in the numbering of the nucleotides. In the application, an 
siRNA is also referred to as an oligo. 

In this disclosure, design of siRNA* is discussed in reference to silencing a sense 
strand target, i.e., transcript target sequence corresponding to the sense strand of the siRNA. 
It will be understood by one skilled person in the ait that die methods of the invoition are 
20 also applicable to the design of siRNA for silencing an antisense target (see, e.g., Martinez et 
al., 2002, CeU 110:563-574). 

5.1. METHODS OF IDENTIFYING SEQUENCE MOTIFS IN A GENE FOR TARGETING 

BY A SMALL INTERFERING RNA 

The invention provides a method of identifying a sequence motif in a transcript which 
25 may be targeted by an siRNA for degradation of the transcript, e.g., a sequence motif that is 
likely to be a hi^y effective siRNA targeting site. Such a sequence motif is also referred to 
as an siRNA susceptible motif. The method can also be used for identifying a sequence motif 
in a transcript which may be less desirable for targeting by an siRNA, e.g., a sequence motif 
that is likely to be a less effective siRNA targeting site. Such a sequence motif is also 
30 referred to as an siRNA resistant motif. 
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In one embodnnent, sequence features characteristic of a functional sequence motif, 

e,g., an siRNA susceptible sequence motif, are identified and a profile of the functional motif 

is built using, e.g., a library of siRNAs for which silencing efficacy of has been determined. 

In one embodiment, the sequence region of interest is scanned to identify sequences 
that matdi the profde of the functional motif. 

5.1.1. SEQUENCE PROFILE AND T Al^nFT Sn F. NCING EFHCACY 

In a preferred embodiment, the profile of a functional sequence motif is represented 
using a position-specific score matrix (PSSM). A general discussion of PSSM can be found 
in, e.g., "Biological Sequence Analysis*' by R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, 
Cambridge Univ. Press, 1998; and Henikoff et al., 1994, J Mol Biol. 243:574-8. A PSSM is 
a sequence motif descriptor which c^tures the characteristics of a functional sequence motif. 
In this disclosure, a PSSM is used to describe sequence motifs of the invention, e.g., a 
susceptible or resistant motif. A PSSM of an siRNA susceptible (resistant) motif is also 
. referred to as a susceptible (resistant) PSSM. A skilled person in the art will know that a 
position-specific score matrix is also termed a position specific scoring matrix, a position 
weight matrix (PWM), or a Profile. 

In the present invention, a functional motif can comprise one or more sequences in an 
siRNA target sequence. For example, the one or more sequences in an siRNA target 
sequence may be a sequence at 5' end of the target sequrace, a sequence at 3' end of the 
target sequence. The one or more sequences in an siRNA target sequence may also be two 
stretches of sequences, one at 5' end of the target sequence and one at 3' end of the target 
sequence. A functional motif can also comprise one or more sequences in a sequence region 
that flanks the siRNA target sequence. Such one or more sequences can be directly adjacent 
to the siRNA target sequence. Such one or more sequences can also be separated from the 
siRNA target sequence by an intervening sequence. FIG. 10 illustrates some examples of 
functional motifs. 

In one embodiment, a functional sequence motif, e.g., a susceptible or resistant 
sequence motif, comimses at least a portion of a sequence targeted by an siRNA. In one 
embodiment, the functional motif comprises a contiguous stretdi of at least 7 nucleotides of 
the target sequence. In a preferred embodiment, the contiguous stretch is in a 3' region of the 
target sequence, e.g., beginning widiin 3 bases at the 3' end. In another embodiment, the 
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contiguous Stretch is in a 5' region of the target sequence. In another embodiment, the 

functional motif comprises a contiguous stretch of at least 3, 4, S» 6, or 7 nucleotides in a 3' 

region of the target sequence and comprises a contiguous stretch of at least 3, 4, 5, 6, or 7 

nucleotides in a region of the target sequence. In still another embodiment, the functional 

5 motif comprises a contiguous stretch of at least 1 1 nucleotides in a central region of the target 
sequence. Sequence motifs comprise less than the fiill length of siRNA target sequence can 
be used for evaluating siRNA target transcripts that exhibit only partial sequence identify to 
an siRNA (International application No. PCT/US2004/015439 by Jackson et al., filed on May 
17» 2004, which is incorporated herein by reference in its entirety). In a preferred 

10 embodiment, the functional motif comprises the full length siRNA target sequence. 

The functional motif may also comprise a flanking sequence. The inventors have 
discovered that the sequence of such flanking region plays a role in determining the efficacy 
of silencing. In one embodiment, a functional sequence motif, e.g., a susceptible or resistant 
sequence motif, comprises at least a portion of a sequence targeted by an siRNA and one or 

J 5 more sequences in one or both flanking regions. Thus, a sequence motif can include an M 
nucleotides siRNA target sequence, a flanking sequence of D] nucleotides at one side of the 
siRNA target sequmce and a flanking sequrace of D2 nucleotides at the other side of the 
siRNA target sequence where M, Dj and Dz are appropriate integers. In one embodiment, Di 
^Dz^D. In one embodiment, M = 19. In some preferred embodiments, D7, D2, or D is at 

20 least 5, 10. 20, 30, 50 nucleotides in length. In a specific embodimeat, a susceptible or 
resistant sequence motif consists of an siRNA target sequence of 19 nucleotides and a 
flanking sequence of 10 nucleotides at eitiier side of the siRNA target sequence. In another 
specific embodiment, a susceptible or resistant sequence motif consists of a 19 nucleotides 
siRNA target sequence and a SO nucleotides flanking sequence at either side of the siRNA 

25 target sequence. 

In another embodiment, a sequence motif can include an M nucleotides siRNA target 
sequrace, and one or more of the following: a contiguous stretdi of Di nucleotides flanking 
the 5' end of the target sequence, a contiguous stretdi of D2 nucleotides flanking the 3' end of 
the target sequence, a contiguous stretch of D3 nucleotides whidi starts about 35 nucleotides 
30 upstream of the 5' end of the target sequence, a contiguous stretch of D4 nucleotides which 
starts about 25 nucleotides downstream of the 3' end of the target sequence, and a contiguous 
stretdi of D5 nucleotides which starts about 60 nucleotides downstream of the 3' end of the 

25 

100005189 



wo 2005/042708 PCT/US2004/035636 

target sequence, where D/, Dz, D3D4, and Dj are appropriate integers. In one embodiment, 
D; = D2 = D. In some preferred embodiments, each of Di, D2, DsD4, and Ds is at least 5, 10, 
or 20 nucleotides in length. The length of the functional motif is L = M + D; + D2 + Di + 
+ D5. In a specific embodiment, the sequence motif include 19 nucleotides siRNA target 
sequence, a contiguous stretch of about 10 nucleotides flanking the 5' end of the target 
sequence, a contiguous stretch of about 10 nucleotides flanking the 3' end of the target 
sequence, a contiguous stretdi of about 10 nucleotides whidi starts about 35 nucleotides 
upstream of the 5' end of the target sequence, a contiguous stretch of about 10 nucleotides 
which starts about 25 nucleotides downstream of the 3' end of the target sequence, and a 
contiguous stretch of about 10 nucleotides which starts about 60 nucleotides downstream of 
the 3' end of the target sequence (see FIG. 10). 

In other embodiments, a functional sequence motif, e.g., a susceptible or resistant 
sequence motif, comprises one or more sequences in one or both flanking regions of an 
siElNA target sequence but does not comprise any siRNA target sequence. In one 
embodiment, the functional motif comprises a contiguous stretch of about 10 nucleotides 
flanking the 5' end of the target sequence. In another embodiment, the functional motif 
comprises a contiguous stretch of about 10 nucleotides flanking the 3' end of the target 
sequence. In a preferred embodiment, the functional motif comprises a contiguous stretdi of 
about 10 nucleotides flanking the 5' end of the target sequence and a contiguous stretch of 
about 10 nucleotides flanldng the 3' end of the target sequence. In one embodimmt, the 
functional motif comprises a contiguous stretch of about 10 nucleotides which starts about 35 
nucleotides upstream of the 5' end of the target sequence. In another embodiment, the 
functional motif comprises a contiguous stretch of about 10 nucleotides which starts about 25 
nucleotides downstream of the 3' end of the target sequence. In still another embodiment, the 
functional motif comprises a contiguous stretch of about 10 nucleotides whidi starts about 60 
nucleotides downstream of the 3' end of the target sequence. In a preferred embodiment, the 
functional motif comprises a contiguous stretch of about 10 nucleotides flanking the 5' end of 
the target sequence, a contiguous stretch of about 10 nucleotides flanking the 3' end of the 
target sequence, a contiguous stretch of about 10 nucleotides whidi starts about 35 
nucleotides upstream of the 5' end of the target sequence, a contiguous stretdi of about 10 
nucleotides which starts about 25 nucleotides downstream of the 3' end of the target 
sequence, and a contiguous stretch of about 10 nucleotides which starts about 60 nucleotides 
downstream of the 3' end of the target sequence. Thus, a sequence motif can include a 
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contiguous stretch ofDi nucleotides flanking the 5' end of the target sequence, a contiguous 
stretch of Dz nucleotides flanking the 3' end of the target sequence, a contiguous stretch of Ds 
nucleotides which starts about 35 nucleotides upstream of the 5' end of the target sequence, a 
contiguous stretch of D4 nucleotides which starts about 25 nucleotides downstream of the 3' 
5 end of the target sequence, and a contiguous stretch of D5 nucleotides which starts about 60 
nucleotides downstream of the 3' end of the target sequence, where Du D2, D3,D4y and D5 are 
appropriate integw. In some preferred embodiments, each of D/, D2, D5. and Ds is at 
least 5, 10, or 20 nucleotides in length. The length of die functional motif is L = Dj + D2 + 

10 In one embodiment, the characteristics of a functional sequence motif are 

charactOTzed using the frequency of each of G, C, A, U(or T) observed at each position along 
the sequence motif. In the disclosure, U(or T), or sometunes sunply U(T), is used to mdicate 
nucleotide U or T. Hie set of frequencies forms a frequency matrix, m which each element 
indicates the number of times that a given nucleotide has been observed at a given position. 

15 A frequency matrix representing a sequence motif of length L is a 4 • L matrix {/(/}, where i = 
G, C A U(T);j^h2, L; where/y is the frequency of die ith nucleotide at the Jth position. 
A frequency matrix of a sequence motif can be derived or built from a set of N siRNA target 
sequences that exhibit a desired quality, e.g., a diosen level of susceptibility or resistance to 
siRNA silencing. 

20 fij = f,S,,(j) (1) 

where ^M^^^^ll] (2) 

In embodiments in whidi a functional sequence motif consists of M nucleotides siRNA target 
sequence, a flanking sequence of £)/ nucleotides at one side of the siRNA target sequence and 
a flanking sequence of D2 nucleotides at the other side of the siRNA target sequence, L = Af + 
25 Di-i-Dz. Ihembodimentsinwhidithefunctionalmotif consists of Af nucleotides siRNA 
target sequence,a contiguous stretch of Dj nucleotides flanking the 5' end of the target 
sequence, a contiguous stretch of D2 nucleotides flankmg the 3' end of die target sequence, a 
contiguous stretch of £>i nucleotides which starts about 35 nucleotides upstream of die 5' end 
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of the target sequence, a contiguous stretch of D4 nucleotides which starts about 25 
nucleotides downstream of the 3' end of the target sequence, and a contiguous stretch of D5 
nucleotides which starts about 60 nucleotides downstream of the 3' end of the target 
sequence,L = Dj +D2 + I>i + X>^ + £>5. 

In another embodiment, the characteristics of a functional sequence motif are 
characterized using a set of weights, one for each nucleotide occuiring at a position in the 
motif. In such an embodiment, a weight matrix {^(/}, where i - G, C, A, U(T);j = 1, 2, 
L, can be used for representing a functional sequence motif of length L, where eij is the weigjit 
of finding the ith nucleotide at the jth position. In one embodiment, tlie weight eij is the 
probability of finding the zth nucleotide at the jtfa position in the functional sequence motif. 
When a probability is used for the weight, the matrix is also called a probability matrix. A 
probability matrix of a sequence motif can be derived from a frequency matrix according to 
equation 



In a preferred embodiment, a position-specific score matrix is used to characterize a 
functional sequence motif. The PSSM can be constructed using log likelihood values 
Iog(e(/p(/), where is the weight of finding nucleotide / at position j, and pij is the weigfht of 
finding nucleotide i at position j'ma random sequence. In some embodiments, the 
probability of finding the rth nucleotide at the jth position in the functional sequence motif is 
used as e^, the probability of finduig nucleotide / at position j in a random sequence is used as 
py. The weight or probability is an ""a priori' weight or probability. In some 
embodiments, pij = 0.25 for eadi possible nucleotide i e {G, C, A, l](T)} at each position^'. 
Thus, for a given sequrace of Iragth L, die sum of log likelihood ratios at all positions can be 
used as a score for evaluating if the given sequence is more or less likely to match the 
functional motif than to match a random sequence: 




(3) 



L 



Score-^]niej/ pj) 



(4) 
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whereinew; and pj are respectively weights of a nucleotide at position J in the functional 
sequence motif and in a random sequence. For example, if such a score is zero, the sequence 
has the same probability to match the sequence motif as to that to match a random sequence. 
A sequence is more likely to match the sequence motif if the ratio is greater than tj&co. 

5 hi another embodiment, when two or more different nucleotides are not to be 

distinguished, a PSSM with a reduced dimension can be used For example, if the relative 
base compositions of G and C in a sequence motif are not to be distinguished, a PSSM can be 
a 3 • L matrix [Xo^Eijlpij)] , where i = G/C, A, U(T);j = 7, 2, L; where Eij is the weight, 
e.g., probability, of finding nucleotide i at position j\ and pij is the weight, e.g., probability, of 

10 finding nucleotide i at position j in a random sequence. Thus, in such cases, a PSSM has 3 
sets of weights: GC-specific, A-specific and U-specific, e.g., if the base at a position is a G or 
a C, the natural logarithm of the ratio of the GC weight and the unbiased probability of 
finding a G or C at that position is used as the GC-specific weight for the position; and the 
natural logarithms of the position-specific A and T weights divided by the unbiased 

IS probability of respective base are used as the A- and T-specific weigjits for the position, 
respectively. The log likelihood ratio score is represented by Eq. (5): 

Score^YM^i'Pj^ (5) 

M 

where E) is the weight assigned to a base — A, U or G/C — at position jy and pj = 0.25 for A 
or U and 0.5 for G/C. 

20 In still another embodiment, when the relative base compositions of G and C in a 

sequence motif are not to be distinguished and the relative base compositions of A and T in 
the sequence motif are also not to be distinguished, a PSSM can be a 1 - L matrix 
{log(£'i/py)}, where i = G/C;j = 7, 2, L\ where Ei) is the weight, e.g., probabiUty, of 
finding nucleotide i at position and is the weight, e.g., probability, of finding nucleotide i 

25 at position j in a random sequence. Thus, in such cases, a PSSM has 1 set of GC-specific 
weights: if the base at a position is a G or a C, the natural logarifimi of the ratio of the GC 
weight and the unbiased probability of finding a G or C at that position is used as the GC- 
specific weight for the position. The log likelihood ratio score is represented by Eq. (5), 
except that £) is the weight assigned to a base — G/C — at position j, and pj = 0.50. 
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5.L2. METHODS OF DETERMINING A PROFILE 

Hie invention provides methods of determining a PSSM of a functional sequence 
motif based on a plurality of siRNAs for which some quantity or quantities characterizing the 
siRNAs have been determined. For example, a plurality of siRNAs whose silencing efficacy 

5 has been detennined can be used for determination of a PSSM of an siRNA susceptible or 
resistant sequence motif. In the disclosure, for sinq>licity leasons, efficacy is often used as a 
measure for classifying siRNAs. Efficacy of an siRNA is measured m the absence of other 
siRNAs designed to silence the target g^e. It will be apparrat to one skilled peison in the art 
that the methods of die invention are equally applicable in cases where siRNAs are classified 

10 based on another measure. Such a plurality of siRNAs is also referred to as a library of 
siRMAs. In cases where the functional sequence motif of interest comprises one or more 
sequences in one or both flanking regions, a plurality of siRNA functional motifs, i.e., a 
sequence comprising the siRNA target sequence and the sequences in the flanking region(s) 
in a transcript, can be used to determine the PSSM of the functional motif. In a preferred 

15 embodiment, the siRNA functional sequence motif consists of an siRNA target sequence of 
19 nucleotides and a flanking sequence of 10 nucleotides at either side of the siRNA target 
sequence. For simplicity reasons, in this disclosure, unless specified, the term *'a library of 
siRNAs" is often used to referred to both a library of siRNAs and a library of siRNA 
ftmctional motifs. It will be understood that in the latter cases, when the efficacy of an 

20 siRNA is referred to, it refers to the efficacy of tiie siRNA that targets the motif. Preferably, 
the plurality of siRNAs or siRNA target motifs comprises at least 10, 50, 100, 200, 500, 
lOOO, or 10,000 different siRNAs or siRNA target motifs. 

Each different siRNA in the plurality or library of siRNAs or siRNA functional motifs 
can have a different level of efficacy. In one embodiment, the plurality or library of siRNAs 
25 consists of siRNAs having a chosen level of efficacy. In another embodiment, the plurality 
or library of siRNAs comprises siRNAs having different levels of efficacy. In such an 
embodiment, siRNAs may be grouped into subsets, each consisting of siRNAs that have a 
chosen level of efficacy. 

In one embodiment, a PSSM of an siRNA functional motif is determining using a 
30 plurality of siRNAs having a given efficacy. In one embodiment, a plurality of N siRNAs 
consistmg of siRNAs having a silencing efficacy above a chosen direshold is used to 
determine a PSSM of an siRNA susceptible motif. The PSSM is determined based on the 
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frequency of a nucleotide appeared at a position (see Section 5.1,1). The chosen threshold 
can be 50%. 75%, 80% or 90%. In another embodiment, a plurality of N siRNAs consisting 
of siRNAs having a silencing efficacy below a chosen threshold is used to detennine a PSSM 
of an siRNA susceptible motif. Hie chosen threshold can be 5%, 10%, 20%, 50%, 75% or 
S 90%. In a preferred embodiment, the PSSM has a reduced dimension with a weight for G/C. 

In preferred embodiments, a PSSM of a susceptible or resistant motif is derived or 
built using a classifier approach with a set of N sequences. In such embodiments, a library of 
siRNAs comprising siRNAs having different levels of efficacy are used. In one embodiment, 
siRNAs in the library may be randomly grouped into subsets, each consisting of siRNAs that 

iO have different levels of efficacy, one subset is used as a training set for determining a PSSM 
and the other is used as a testmg set for validating the PSSM. Different criteria can be used 
to divide the existing siRNA library into training and test sets. For an siRNA library in whidhi 
a majority of siRNA oligos are designed with the standard method, which requires an AA 
dimer immediately before the 19mer oligo sequence, several partitions were used and more 

IS than one trained PSSMs (rather than single PSSMs) were combined to assign scores to the 
test oligos. An exemplary siRNA library and divisions of the library into training and test 
sets are shown in Table n. 

In a preferred embodiment, the sequence motif consists of 39 bases in the transcript 
sequence, beginning 10 bases upstream of the 19mer siRNA target sequence and ending 10 
20 bases downstream of the 19mer. The PSSM characterizing such a sequence motif is 
described in Section 5.1.1. 

In a preferred embodiment, the PSSM is determined by an iterative process. A PSSM 
is initialized with random weights {eij} or {E^] within a given search range for all bases at all 
positions. In another preferred embodiment, PSSM is initialized to the smoothed mean base 

25 composition difference between good and bad siRNAs in the training set As an example, a 
PSSM describing a 39 nucleotide sequence motif can have 117 elements. In another 
embodiment, the weights are optimized by comparing the correlation of scores generated to a 
quantity of interest, e.g., silencing efficacy, and selecting the PSSM whose score best 
correspond to that quantity. Improvement in PSSM performance is scored by comparing 

30 correlation values before and after a change in wei^ts at any one position. In one 
embodiment, there is no minimum requirement for a change in correlation. Aggregate 
improvement is calculated as the difference between the final correlation and the initial 
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correlation. In one embodinient, for a PSSM characterizing a 39mer sequence motif, the 
aggregate improvement threshold after 1 17 cycles for termination of optimization is a 
difference of 0.01. 

In one embodiment, the weights are optimized to reflect base composition differences 
5 between good siRNAs, i.e., siRNAs having at least median efficacy, and bad siRNAs, i.e., 
siRNAs having below median efficacy, in the range of allowed values for weights. If the 
PSSM is initialized with a frequency matrix, the range of allowed values corresponds to the 
frequency matrix elements +/- 0.05. If an unbiased search is used, the ranges of the allowed 
values for weights are 0.45-.55 for G/C and 0.2-0.3 for A or U. In one embodiment, weights 
10 are allowed to vary from initial values by -f/-0.05. If an unbiased search is used, the PSSM 
weights can be set to random initial values within the unbiased search range described above. 

In one embodiment, the PSSM is determined by a random hill-<:limbing mutation 
optimization procedure. In eadi step of the process, one base at one position is randomly 
selected for optimization. For example, for a PSSM describing a 39 nucleotide sequence 

15 motif, the 39 bases become a vector of 1 17 weights: 39 G/C weights, 39 A weights and 39 U 
weights. One of these 117 weights is selected for optimization in each step, and is run 
through all values in the search range at that step. For each value in the search range, scores 
for a tra inin g set of siRNAs are calculated. The correlation of these scores with the silencing 
efficacy of the siRNAs is then calculated. The weight for that position which generate the 

20 best correlation between the scores and silencing efficacy is retained as the new weight at that 
position. 

In one embodiment, the metric used to measure the effectiveness of the training and 
testing is the aggregate false detection rate (FDR) based on the ROC curve, and is computed 
as the average of the FDR scores of the top 33% oligos sorted by the scores given by the 

25 trained PSSM. In computing the FDR scores, those oligos with silencing levels less than the 
median are considered false, and those with silencing level higher than the median level are 
considered true. The "false detection rate" is the number of false positives selected divided 
by the total number of true positives, measured at each ranked position in a list. The false 
detection rate can be a function of the fraction of all siRNAs selected. In one embodiment, 

30 the area under the curve at 33% of the list selected as a single number representing 

performance. In one embodiment, all at-least-median siRNAs are called as "positives" and 
all worse-than-median siRNAs are called "negatives." Thus, half the data are positives and 
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the Other half are "false positives/' In an ideal ranking, the area under the curve at 33% or 

even at 50% of the list selected should be 0. In contrast, a random ranking would cause equal 

numbers of true positives and false positives to be selected. This corresponds to an area 

under the curve of 0, 17 at 33% of the list selected, or .25 at 50% of the list selected. 

5 Correlations between % silencing and PSSM score are calculated according to method 

known in the art (see, e.g., Applied Multivariate Statistical Analysis, 4th ed., R.A. Johnson & 
E.W, Wichem, Prentice-haU, 1998). 

The process is continued xintil the aggregate improv^ent over a plurality of iterations 
fell below a threshold. 

10 In a preferred embodiment, a plurality of PSSMs are obtained for a functional 

sequence motif using an siRNA training set. In this disclosure, a plurality of PSSMs is also 
referred to as an "ensemble" of PSSMs. Each round of optimization may stop at a local 
optimum distinct from the global optimum. The particular local optimum reached is 
dependent on the history of random positions selected for optimizatiozL A higher 

15 improvement threshold may not bring a PSSM optimized to a local optimum closer to the 
global optimmn. Thus it is more effective to run multiple optimizations than one long 
optimization. Additional runs (e.g., up to 200) were found to enhance performance. Running 
more than 200 optimizations was not seen to provide further enhancements in p^formance. 
Empirically, scoring siRNAs via tihe average of multiple runs is less effective than scoring 

20 candidate siRNAs on the PSSMs generated by each run and then summing the scores. Thus, 
in one embodiment, the plurality of PSSMs are used individually or summed to generate a 
composite score for each sequence match. The plurality of matrices can be tested 
individually or as a composite on an independent set of siRNA target motifs with known 
silencing efficacy to evaluate the utility for identifying sequence motift and in siRNA design. 

25 In a preferred embodiment, the plurality of PSSM consists of at least 2, 10, 50, 100, 200, or 
500 PSSMs. 

In a preferred embodiment, one or more diffi^rmt siRNA training sets are used to 
obtain one or more ensemble of PSSMs. These differrat ensembles of PSSMs may be used 
together in determining the score of a sequence motif. 

30 Sequence weigjiting methods have been used in the art to reduce redundancy and 

emphasize diversity in multiple sequence alignment and searching applications. Each of 
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these methods is based on a notion of distance between a sequence and an ancestral or 
generalized sequence. Here a different approach is presented, in which base weights on the 
diversity observed at each position in the alignment and the correlation between the base 
composition and the observed efficacy of the siRNAs, rather than on a sequence distance 
measure. 

In still another embodiment, PSSMs are generated by a mediod whidx hypothesized 
dependency of the base composition of any one position on its neighboring positions, refarred 
to as **curve models." 

In one embodiment, curve models are generated as a sum of normal curves (i.e., 
Gaussian). It will be apparent to one skilled person in the art that other suitable curve 
functions, e.g., polynomials, can also be used. Each curve represents the probability of 
finding a particular base in a particular region. The value at each position m the summed 
normal curves is the weight given to that position for the base represented by the curve. TTie 
weights for each base present at each position in each siRNA and its flanking sequences are 
then summed to generate an siRNA's score, i.e., the score is £ Wj. The score calculation can 
also be described as the dot product of the base content in the sequence with the weights in 
the curve model. As such, it is one way of representing the correlation of the sequence of 
interest with the model. 

Curve models can be initialized to correspond to the major peaks and valleys present 
in the smoothed base composition difference between good and bad siRNAs, e.g., as 
described in FIGS. lA-C and 5A-C. In one embodiment, curve models for G/C, A and U are 
obtained. In one embodiment, the initial model can be set up for the 3-peak G/C curve model 
as follows: 



Peakl 



mean: 



1.5 



standard deviation: 2 



amplitude: 



0.0455 
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Peak 1 mean, standard deviation and amplitude are set to correspond to the peak in the mean 
difference in GC content between good and bad siRNAs occurring within bases -2 - 5 of the 
siRNA target site in Set 1 training and test sets. 



Peak2 

5 mean: 11 

standard deviation: O.S 

amplitude: 0.0337 



Peak 2 mean» standard deviation and amplitude are set to correspond to the peak in the mean 
difference in GC content between good and bad siRNAs occurring within bases 10-12 of the 
10 siRNA target site in Set 1 training and test sets. 



Peaks 

mean: 18.5 

standard deviation: 4 

amplitude: -0.0548 



15 Peak 3 mean, standard deviation and amplitude are s^ to correspond to the peak in the mean 
difference in GC content between good and bad siRNAs occurring withm bases 12-25 of the 
siRNA taiget site in Set 1 traiaing and test sets. 

Peak height (amplitude), center position in the sequence (mean) and width (standard 
deviation) of a peak in a curve model can be adjusted. Curve models are optimized by 
20 adjusting the amplitude, mean and standard deviation of each peak over a preset grid of 
values. In one embodiment, curve models are optimized on several training sets and tested 
on several test sets, e.g., training sets and test sets as described ia Table BL Each base - G/C, 
A and U(or T) - is optimized separately, and then combinations of optimized models are 
screened for best p^ormance. 

25 Prefaably, optimization criteria for curve models are: (1) the fraction of good oligos 

in the top 10%, 15%, 20% and 33% of the scores, (2) the false detection rate at 33% and 50% 
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10 



15 



20 



25 



of the siRNAs selected, and (3) the correlation coefficient of siRNA silencing vs. siRNA 
scores used as a tiebreaker. 

When the model is trained, a grid of possible values for amplitude, mean and standard 
deviation of each peak is explored. The models with die top value or within the top range of 
values for any of the above criteria were selected and examined further. 

In a preferred embodiment, G/C models are optimized with 3 or 4 peaks, A models 
are optimized with 3 peaks, and U models are optimized with 5 peaks. Exemplary ranges of 
parameters optuxdzed for curve models are shown in Example 3, infra. 

Preferably, the performance of the obtained PSSM is evaluated. In one embodiment, 
the PSSM is evaluated using an ROC (receiver operating characteristic) curve. An ROC 
curve is a plot of the sensitivity of a diagnostic test as a function of non-specificity. An ROC 
curve indicates the intrinsic properties of a test's diagnostic performance and can be used to 
compare relative merits of competing procedures. In one embodiment, the sensitivity of a 
PSSM is calculated as the proportion of true positives detected as a fraction of total true 
positives, whereas the non-specifidty of the PSSM is calculated as the proportion of false 
positives detected as a fraction of total false positives (see, e.g., G. Chambell, 1994, Statistics 
in Medicine 13:499-508; Metz, 1986, Investigative Radiology 21:720-733; Gribskov et al., 
1996, Computers Chem. 20:25-33). FIG. 3 shows ROC curves of the two PSSMs selected 
for the current best practice of the invention. 

In anoth^ embodiment, the performance of a PSSM is evaluated by comparing a 
plurality of sequence motifs identified using the PSSM with a plurality of reference sequence 
motifs. The PSSM is used to obtain the plurality of sequence motifs by, e.g., scanning one or 
more transcripts and identifying sequence motifs that match the PSSM, e.g., with a score 
above a threshold. Preferably, the plurality comprises at least 3, 5, 10, 20 or 50 different 
sequence motils. The reference sequence motifs can be fix>m any suitable source. In one 
embodiment, a plurality of reference sequence motifs is obtained using a standard method 
(e.g., Elbashir et al., 2001, Nature. 411:494-8). The two pluralities are then compared using 
any standard method known in the art to determine if they are identical. 

In a preferred embodiment, the two pluralities are compared using a Wilcoxon rank 
sum test. A Wilcoxon rank sum test tests if two pluralities of measurements are identical 
(see, e.g., Snedecor and Cochran, Statistical Methods, Eighth Edition, 1989, Iowa State 
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University Press, pp. 142-144; McClave and Sincich, 2002, Statistics, Ninth Edition, Prentice 

Hall, Chapter 14). Tht Wilcoxon rank sum test can be considered a non-parametric 

equivalent of the unpaired t-test It is used to test the hypothesis that two independent 

samples have come from the same population. Because it is non-parametric, it makes only 

5 limited assumptions about the distribution of the data. It assumes that the shape of the 

distribution is similar in tlie two groups. This is of particular relevance if the test is to be 

used as evidence that die median is significantly different between the groups. 

The test ranks all the data from both groups. The smallest value is given a rank of 1, 
the second smallest is given a rank of 2, and so on. Where values are tied, they are given an 

10 average rank. The ranks for each group are added together (hence the term rank simi test). 
The sums of the ranks is compared with tabulated critical values to generate a p value. In a 
Wilkoxon rank sum test, p, a funaion of X, Y, and ot, is the probability of observing a result 
equal or more extreme than the one using the data (X and Y) if the null hypothesis is true. 
The value of p indicates the significance for testing the null hypothesis that fhe populations 

15 generating the two independent samples, X and Y, are identical. X and Y are vectors but can 
have different lengths, i.e. , the samples can have different number of elements. The 
alternative hypothesis is that the median of the X population is shifted from the median of the 
Y population by a non-zero amount a is a given level of significance and is a scalar between 
zero and one. In some embodiment* the default value of a is set to O.OS. If p is near zero, the 

20 null hypothesis may be rej ected. 

In one embodimeat, the PSSM approach of the present invention was compared to the 
standard method (e.g., Elbashir et al., 2001, Nature 411:494-8) for its performance in 
identifying siRNAs having high efficacy. The results obtained wiih three siRNAs selected by 
each method are shown ia Figure 3. siRNAs selected by the method using the PSSM showed 

25 better median efficacy (88 % as compared to 78% for the standard method siRNA) and were 
more uniform in their performance. The minimum efficacy was greatly improved (75% as 
compared to 12% for the standard method). The distribution of silencing efficacies of 
siRNAs designed using the algoridmi based on PSSM was significantly better than that of the 
siRNAs designed using the standard method for the same genes (p=0.004, Wilcoxon rank sum 

30 test). 

5.1.3. ALTERNATIVE METHOD FOR EVALUATING SILENCING EFHCACY 

OFsiRNAS 
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Position-specific scoring matrix approadies are the preferred method of representing 
siRNA functional motifs, e.g., siRNA susceptible and resistant motifs. However the 
information represented by PSSMs can also be represented by other methods which also 
provide weights for base-composition at particular positions. This section provides such 
methods for evaluating siRNA functional motifs. 

5.L3.1, METHODS BASED ON SEQUENCE WINDOWS 

A common miethod of weighting base-composition at positions in a sequence is to 
tally the number of a particular base or set of bases in a ''window" of sequence positions. 
Alternatively, the tally is represented as a percentage. The number of values of such a score, 
referred to as a window score, depends on the size of the window. For example, scoring a 
window of size 5 for G/C content may give values of 0, 1, 2, 3, 4 or 5; or 0%. 20%, 40%, 
60%, 80% or 100%. 

An alternative method of scoring a window is to calculate the duplex melting 
temperature or AG for the bases in that window. These thermodynamic quantities reflect the 
composition of all bases in the window as well as their particular order. It is readily apparent 
to one of skill in the art that these thermodynamic quantities directly depend on the base 
composition of each window, and are dominated by the G/C content of the window while 
showing some variation with the order of the bases. 

In one embodiment, the information represented by the base-composition differences, 
e.g., in Figures lA, IB and IC, is represented by windows of base-composition 
corresponding to the positions to the peaks of increased or decreased composition of a 
particular base(s). These windows can be scored for content of the particular base(s), with 
increased or decreased base composition correspondmg to sequences which are more or less 
functional or resistant for siRNA targeting. For example, a 5-base window of increased G/C 
content from base -1 to base 3 relative to the siRNA 19mer duplex, and a 16-base window of 
decreased G/C content from base 14 to base 29 relative to the siRNA 19mer duplex, can be 
used to represent some of the siRNA functional motif reflected in Figure 1 A. 

The scores may be used directly as a classifier: in the example of a 5-base window, a 
5-part classifier is automatically available. Scores can also be compared to a calculated or 
empirically derived threshold to use the window as a 2-part classifier. Windows can also be 
used in combination. The scores of each sequence over multiple windows can be summed 



38 



wo 2005/042708 PCT/US2004/035636 
with or without normalization or weighting. In one embodiment, scores for eadi window are 

normalized by subtracting the mean score in a set of scores and then dividing by the standard 

deviation in the set of scores. In another embodiment, scores are weighted by the Pearson 

correlation coefficient obtained by comparing that window's score with die measured 

5 efficacy of a set of siRNAs. In anotiier embodiment, scores are normalized, and then 

weighted before summatioa 

As an example of the use of windows to represent siRNA functional motifs, the 
following list of parameters was considered for prediction of siRNA efficacy: 

L Straight-forward parameters. 

10 ATGJDist - distance to the start codon. 

STOPJDist - distance to the end of the coding region 

Coding_PercOTt - ATG.Dist as percentage of the length of coding region 

End_Dist - distance to the end of the transcript 

Total^Percent - start position as a percentage of the length of the transcript sequence. 

15 2. Window-based parameters. 

1 19 bases on the transcript sequence were considered (19mer plus 50 bases 
downstream and 50 bases upstream). Windows of sizes 3-10 were examined for each position 
from the beginning to the end of the 1 19-base chunk. The following items were counted for 
each window position: 

20 a. Numbers of bases: A, C, G, or U. 

b. Numbers of pairs of bases: M (A or C), R (A or G), W(A or U), S (C or G), Y (C 
orU),andK(GorU). 

c. Numbers of various ordered dimeis: AC, AT, AG, MM, RY, KM, SW, etc. 

d. The longest stretdies of the above one base or two-base units. 
25 3. Motif-based parameters. 
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These parameters are also based on the 1 19-base chunks. The letters include the bases 
(A, C, G, U) and pairs of bases CM, W. S, Y, K). 

(1) . Position-Specific one-mer, dimers, or trimers. 

(2) . Numbers of Imers to 7mers m four large regions: 50 bases upstream, 19mer 
proper, 50 bases downstream, and the whole 119mer region. 

4. Structural parameters. 

The stmctural parameters are based on the following regions. 

the 19mer oligo proper (prefix: proper) 

the 20mer immediate upsUeam the oligo (prefix: up20) 

the 4Qmer immediate upstream the oligo 

the 60mer immediate upstream the oligo 

the 20mer immediate downstream the oligo (prefix: down20) 

the 40mer immediate downstream the oligo 

the 60mer immediate downstream the oligo 

Base-pairing predicted by RNAStructure was examined and the following parameters 
were calculated: 

the count of bulge loops (parameter, bulge) 

the total bases in the bulge loops (bulgej?) 

the count of internal loops (internal) 

the total bases in the internal loops (intemal^b) 

the count of hairpins (hairpin) 

the total bases in the haixpins (hairpiiub) 

the count of other motif re^ons (other) 
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the total bases in the other motif regions (other.b) 

the total paired bases (total^pairs.b) 

the total non-paired bases (total_nonpairs.b) 

the longest stretdi of paired bases (longestjpairs^b) 

5 the longest stretdi of non-paired bases (longest^^oc^airs J>) 

Tlius, a total of 12*7=84 parameters were computed about the secondary structure 
motifs for each siRNA. 

S. Parameters on off-target predictions. 

10 different parameters were computed using the weighted FASTA score discussed in 
10 Section S*2.» the minimax score and the predicted duplex AG discussed in Section S.4, using 
different conditions. 

Parameters were normalized and wei^ted by the Pearson correlation coefficient of 
the scores with the silencing efficacy of the siRNAs examined. Various methods were used 
to select the parameters with the greatest predictive power for siRNA efficacy; the various 
15 methods agreed on the selection 1750 paramet^. 1 190 of these are window-based base 
composition parameters, 559 are motif-based base composition parameters, and only 1 
stmctural parameter was selected. No other parameters were selected. 

5.1.3.2. SEQUENCE FAMILY SCORING METHODS 

Sequence consensus patterns, hidden Markov models and neiural networks can also be 
20 used to represent siRNA functional motifs, e.g., siRNA susceptible or resistant motifs as an 
alternative to PSSMs. 

First, an siRNA functional motifs, e.g., siRNA susceptible or resistant motif can be 
understood as a loose consensus sequence for a family of distantly related sequences - e.g. 
the family of functional siRNA target sites. Scoring sequences for similarity to a family 
25 consensus is well known in the art (Gribskov, M., McLachlan, A.D., and Esienberg, D, 1987. 
Profile analysis: detection of distantly related proteins. PNAS 84:4355-4358; Gribskov, M., 
Luthy, R., and Eisenberg, D. 1990. Profile analyisis. MetKEnzymol 183:146-159). Such 



41 



100005189 




wo 2005/042708 



PCT/US2004/035636 



10 



15 



20 



25 



scoring methods are most commonly referr^ to as "profiles", but may also be referred to as 
"templates" or "flexible patterns" or similar terms. Sudi methods are more or less statistical 
descriptions of the consensus of a multiple sequence aligcunent, using position-specific scores 
for particular bases or amino acids as well as for insertions or deletions in the sequence. 
Weights can be derived from the degree of conservation at each position. A difference 
between consensus profiles and PSSMs as the term is used in this text is that spacing can be 
flexible in consensus profiles: discontinuous portions of an siRNA functiorud motifs, e.g., 
siRNA susceptible or resistant motif can be found at varying distances to each other, with 
insertions or deletions permitted and scored as bases are. 

Profile hidden Markov models are statistical models which also represent the 
consensus of a family of sequences. Krogh and colleagues (Krogh, A., Brown, M., Mian, 
I.S., Sjolander, K. and Haussler, D. 1994. Hidden Markov models in computational biology: 
Applications to protein modeling. /. Mol Biol 235:1501-1531) applied HMM techniques to 
modeling sequence profiles, adopting techniques from speech recognition studies (Rabiner, 
LJR. 1989. A tutorial on hidden Markov models and selected applications to speech 
recogniticHi. Proc. IEEE 77:257-286), The use of hidden Markov models for analysis of 
biological sequences is now well known in the art and applications for hiddra Markov model 
calculation are readily available; for example, the program HMMER 
(http://hmmer.wustl.edu). 

Profile hidden Markov models differ from consensus profiles as described above in 
that profile hidden Markov models have a formal probabilistic basis for setting the weights 
for each base, insertion or deletion at each position. Hidden Markov models can also perform 
the aligtmient of unknown sequences for discovery of motifs as well as determining position- 
specific weights for said motifs, while consensus profiles are generally derived from 
previously aligned sequences. 

Consensus profiles and profile hiddoi Markov models can assume that the base 
composition at a particular position is independent of the base composition of all other 
positions. This is similar to the random-lull-climbing PSSMs of this invention but distinct 
from the windows and curve model PSSMs. 

To capture dependoicy of base composition at a particular position on the 
composition of neighboring positions, Markov modds can be used as fixed-order Markov 
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chains and interpolated Markov models. Salzberg and colleagues applied interpolated 
Markov models to finding genes in microbial genomes as an improvement over fixed-order 
Markov chains (Salzberg, S.L., Delcher, A.L., Kasif, S., and White, O. 1998. NucL Acids 
Res, 26:544-548). A fixed-order Markov chain predicts each base of a sequence as a fiinction 
of a fixed number of bases preceding that position. The number of preceeding bases used to 
predict the next is known as the order of the Markov chain. Interpolated Markov models use 
a flexible number of preceeding bases to predict the base composition at a particular position. 
This permits training on smaller sequence sets. SufHcimt predictive data may be available 
for n-mers of various lengths in a training set sudi that some predictions of succeeding bases 
can be made, while insufficient data may be available for all oligomers at any fixed length. 
Interpolated Markov models thus have more freedom to use preferable longer oligomers for 
prediction than fixed-order Maikov chains, when said long oligomers are sufficiently 
frequent m the training set. Intcxpalated Markov models employ a weighted combination of 
probabilities from a plurality of oligomer lengdis for classification of eadi base. 

Fixed-order Markov chains and interpolated Markov models can represent siRNA 
fimctional motifs, e.g., siRNA susceptible or resistant motifs in terms of the dependency of 
the base-composition at a partioilar position on the composition of die preceding positions. 
An interpolated Markov model building process will discover the oligomers most predictive 
of siRNA functional or nonfunctional motifs. 

Neural networks are also employed to score sequences for similarity to a family of 
sequences. A neural network is a statistical analysis tool used to build a model through an 
iterative learning process. The trained network will then perform a classification task, 
dependent upon the desired output and the training input initially associated with that output 
Typically a neural netwo± prograna or computational device is supplied with a training set of 
sequences and sets up a state representing those sequences. The neural network is then tested 
for performance on a test set of sequences. Neural networks can be used to predict and model 
siRNA functional motifs, e.g., siRNA susceptible and resistant motifs. A disadvantage of 
neural networks is that the actual sequence features of a motif can be difficult or impossible 
to determine from examination of the state of the trained network. 



5.1.4. METHODS OF IDENTIFYING SEQUENCE MOTIFS IN A GENE FOR 
TARGETING BY AN SIRNA 
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The invention provides a method for identifying one or more sequence motifs in a 

transcript which are siRNA-susceptible or -resistant motifs. The corresponding functional or 

unfunctional siRNAs are thereby also provided by the method. In one embodiment, the 

sequence region of interest is scanned to ideatify sequences that match the profile of a 

5 functional motif. In one embodiment, a plurality of possible siRNA sequence motifs 
comprises siRNA sequence motifs tiled across the region at steps of a predetermined base 
intervals ate evaluated to identify sequences that matched the profile. In a preferred 
embodiment, steps of 1, S, 10, IS, or 19 base intervals are used. In a piefened embodiment, 
the entire transcript sequence is scanned. A score is calculated for each differoit sequence 

10 motif using a PSSM as described in Sections S.1.1.-5.L3. The sequences are then ranked 
according to the score. One or more sequences are then selected from the rank list ]n one 
embodiment, siRNA sequence motifs having the highest scores are selected as siRNA- 
susceptible motifs, in another embodiment, siRNA sequence motifs having the lowest scores 
are selected as siRNA resistant motife. 

15 The inventors have discovered that the correlation between silencing efficaqr and die 

base composition profiles of siRNA functional motifs may depend on one or more factors, 
e.g., tlie abundance of the target transcript. For example, the inventors have found that for 
silencing poorly-expressed genes, e.g., genes >vhose transcript levels are less than about 5 
copies per cell, siRNA functional motifs having high GC content asynunetry at the two ends 

20 of the target sequence and having high GC content in the sequence regions flanking the target 
sequence have lower silencing efficiency than siRNA functional motifs having moderate GC 
content asymmetry at the two ends of the target sequence and low GC content in the flanking 
regions. The effect of target transcript abundance on silencing efficacy is illustrated in 
Example 6. 

25 While not to be confined by any theory, the inventors reason that the silencing 

efficacy of a particular siRNA functional motif is a result of the mterplay of a number of 
processes, mcluding RISC formation and siRNA duplex unwinding, diffusion of the RISC 
and target mRNA, reaction of the RISC/target complex, which may include diffusion of the 
RISC along the target mRNA, cleavage reaction, and products dissociation, etc. Thus, the 

30 abundance of the transcript, the base composition profile of the siRNA, the base composition 
profile of the target sequence and flanking sequoices, and the concentration of the siRNA 
and RISC in a cell may all affect silencing efficacy. Different processes may involve 
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different sequeaice regions of an siRNA or siRNA sequence motif, i.e., different sequence 
regions of an siRNA or siRNA sequence motif may have different functions in transcript 
recognition, cleavage, and product release, siRNAs may be designed based on criteria that 
take one or more of such features into account. For example, bases near the 5' end of the 
guide strand are implicated in transcript binding (both on- and off-target transcripts), and 
have been shown to be sufficient for target RNA-binding energy. Weaker base pairing at the 
5' end of the antisense strand (3' end of the duplex) encourages preferential interaction of the 
antisense strand with RISC. e.g., by facilitating unwinding of the siRNA duplex by a 5'-3' 
helicase component of RISC. A preference for U at position 10 of the s^e strand of an 
siRNA has been associated with unproved cleavage efficiency by RISC as it is in most 
endonucleases. Low GC content sequence flanking the cleavage site may enhance 
accessibility of the RISC/nuclease complex fox cleavage, or release of the cleaved transcript, 
consistent with recent studies dononstrating that base pairs formed by the central and 3' 
regions of the siRNA guide strand provide a helical geometry required for catalysis. Thus» 
the mvention provides a method of identifying siRNA sequence motifs (and thus siRNAs) by 
obtaining siRNAs that have an optimal sequence composition in one or more sequence 
regions such that these siRNAs are optimal in one or more the siRNA functional processes. 
In one embodiment, the method comprises identifying siRNA sequence motifs vdiose overall 
sequence and/or differrat sequence regions have desired composition profiles. The method 
can be used to identify siRNAs motifs that have desired sequence com^sition in a particular 
region, thus are optimized for one functional process. Tlie method can also be used to 
identify siRNAs that have desired sequence composition in a number of regions, thus are 
optimized for a number of functional processes. 

In a preferred embodiment, a single siRNA fimc^onal profile, e.g., a profile as 
represented by a set of PSSMs, is obtained, e.g., by trakung with silencing efficacy data of a 
plurality of siRNAs that target genes having different transcript abundances using a method 
described in Section S.1.2 or Section S.I.3., and is used to evaluate siRNA sequmce motifs in 
gene transcripts having abundances in all ranges. In one embodiment, the siRNA sequence 
motifs in gene transcripts having abundances in any range are evaluated based on ttie degree 
of similarity of their sequence base composition profiles to the profile or profiles represented 
by the set of PSSMs. In one embodiment, the PSSM scores of siRNA functional motifs for a 
gene of interest are obtained by a method described in Section 5.1.1. A predetermined 
reference value or reference range of values of the PSSM score is determined based on 
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siRNAs that target genes having expression levels in different ranges. Methods for 

determining the reference value or range of reference value is described below. siRNA 

functional motifs in a particular gene are then ranked based on the closeness of their scores to 

the predetermined reference value or within the reference range. One or more siRNAs 

5 having scores closest to the predetermined value or within the reference range are then ' 
selected. In another embodiment, a predetermined reference value of the PSSM score or a 
reference range of the PSSM scores is used for genes having expression levels in a given 
range. Tlie reference value or the reference range is determined based on siRNAs that target 
genes having expression levels in the range. siRNA functional motifs in a particular gene are 

10 dien ranked based on the closeness of their scores to the predetermined refira'ence value or 
within the reference range. One or more siRNAs having scores closest to the predetermined 
value or within the reference range are thea selected. 

The reference value or the reference range can be determined in various ways. In a 
prefened embodiment, correlation of PSSM scores of a plurality of siRNAs having one or 

15 more features, e.g., having particular efficiency in one or more siRNA functional processes, 
with silencing efficacy is evaluated. In a preferred embodiment, the feature is that the 
plurality of siRNAs targets poorly*expressed genes. The value of the score corresponcUng to 
maximum median silencing is used as the reference value. In a specific embodiment, the 
reference value is 0. One or more siRNAs having PSSM scores the closest to tibe reference 

20 score are selected. 

In another embodiment, the range of scores corresponding to siRNAs having a given 
level of silencing efficacy, e.g., efficacy above 75%, is used as the range for the reference 
values. In one embodiment, effective siRNAs are fotmd to have scores between -300 and 
+200 as long as the GC content in bases 2-7 is controlled. In a specific embodiment, a 
25 reference value of between -300 and +200 is used. One or more siRNAs having PSSM 
scores within die range are selected. 

In another preferred embodiment, a particular score range within the range of PSSM 
scores of the plurality of siRNAs having one or more features, e.g., having particular 
efficiency in one or more siRNA functional processes, is used as the range of the reference 
30 value. In a preferred embodiment, the feature is that the plurality of siRNAs targets poorly- 
expressed genes.In one embodiment, a certain percentile in the range of PSSM scores is used 
as the range of the reference value, e.g., 90%, 80%, 70%, or 60%. In a specific embodiment, 
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the combined PSSM score range in the training set has a maximum of 200, with 97% of the 

scores being 0 or below and 60% of the scores are below -300. 

In still another preferred embodiment, a sum of sccnres from a plurality of sets of 
PSSMs (see Section 5.1.2) is used as the reference score. In a specific embodiment, the 
5 plurality of sets consists of the two sets of PSSMs described previously. The two sets of 
PSSMs differ in the base composition preferred for siRNAs, in particular with respect to the 
GC content of the 19mer and flanking sequences. With a combined score of 0, the PSSM 
sets are in balance in their preference for the siRNA. 

In another preferred embodiment, in addition to the PSSM scores, the siRNA 
10 sequence motifs are also ranked according to GC contrat at positions corresponding to 
positions 2-7 of the corresponding siRNAs, and one or more siRNA sequence motifs that 
have a GC cont^t approximately O.IS to 0.5 (corresponding to 1-3 G or C) in the region are 
selected. 

In still anoth^ preferred embodiment, siRNA sequence motifs having a G or C at the 
15 position corresponding to position 1 of the corresponding 19mer siRNA and a A or T at the 
position corresponding to position 19 of the corresponding 19mer siRNA are selected. In still 
another prderred embodiment, siRNAs motifs in which 200 bases on either side of the 19mer 
target region are not repeat or low-complexity sequences are selected. 

In a specific embodiment, the siRNA sequence motifs selected in the following 
20 manner: (1) they are first ranked according to GC conteat at positions corresponding to 
positions 2-7 of the corresponding siRNAs, and one or more siRNA sequence motifs that 
have a GC content approximately 0.15 to 0.5 (corresponding to 1-3 G or C) in the region are 
selected; (2) next, siRNA sequence motifs having a G or C at the position corresponding to 
position 1 of the corresponding 19mer siRNA and a A or T at the position corresponding to 
25 position 19 of the corresponding 19mer siRNA are selected; (3) siRNAs having PSSM scores 
in the range of -300 to 200 or most close to 0 are then selected; (4) number of off-target 
BLAST match less than 16 are ihen selected; and (5) siRNAs motifs in which 200 bases on 
either side of the 19mer target region are not rq)eat or low-complexity sequoices are 
selected. 

30 In another embodiment, a reference value or reference range for eadi of a plurality of 

different abundance ranges is determined. Selecti(Hi of siRNA functional motifs in a gene of 
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interest is achieved by using the appropriate refererrce value or reference range for the 
abundance range in which the gene of interest falls, to one embodiment, the plmrality of 
different abundance ranges consists of two ranges: below about 3-5 copies per cell, 
corresponding to poorly-expressed genes, and above 5 copies per cell, corresponding to 
5 highly-expressed genes. The reference value or reference range can be determined for each 
abundance range using any one of the methods described above. 

In another embodiment, a plurality of siRNA functional motif profiles are determined 
for a plurality of different transcript abundance ranges* Each such profile is determined 
based on silencing efficacy data of siRNAs that target genes having expression levels in a 

10 given range, i.e., genes whose transcript abundances fall within a given range, using a method 
described in Sections 5.1.2 and 5.1.3., supra, to one embodiment, a set of one or more 
PSSMs for genes having expression levels in a given range are trained as described in Section 
5.1.2. using siRNAs that target genes havmg expression levels in the range. The PSSMs are 
then used for identifying siRNA functional motifs in a target gene whose expression level 

15 falls m the range, e.g., by ranking accordmg to the PSSM scoies obtained using a method 
described in Section 5.1.1. in a preferred embodiment, the transcript abundance ranges are 
divided into two ranges: below about 3-5 copies per cell, corresponding to poorly-expressed 
genes, and above 5 copies per cell, corresponding to highly-expressed genes. Two sets of 
PSSMs are obtatoed, one for each abundance range. siRNA functional motifs to a gene of 

20 toterest can be identified usmg the set of PSSMs that is appropriate for die abundance of the 
gene of interest. 

The mvention also provides methods for evaluating the silencmg efficacies of siRNA 
sequence motifs under different siRNA concentrations. For example, the methods described 
above for evaluatmg silencmg efficacy of siRNA sequence motifs to transcripts havmg 

25 different abundances can be used for such purposes by replactog the abundance parameter 
with the concentration parameter, to one embodiment, a plurality of siRNA functional motif 
profiles are determtoed for a plurality of different siRNA concentration ranges. Each such 
profile can be determtoed based on silenctog efficacy data of different concentration of 
siRNAs targettog genes having a different expression level or havtog an expression level to a 

30 differait range, to one embodiment, such profiles are d^ermtoed for transcripts havmg a 
given abundance or having a abundance witfato a range of abundances. Each such profile can 
be determtoed based on silenctog efficacy data of different concentration of siRNAs targettog 
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genes having the expression level or having an expression level in the range. In one 

embodiment, one or more PSSMs for a given siRNA concentration range are trained based on 

silencing efficacy data of siRNAs having a concentration in the range. The PSSMs can then 

be used for selecting siRNAs that have high efficiency at a concentration that falls in the 

concentration range. In a preferred embodiment, die transcript abundance ranges is selected 

to be below 5 copies per cell. In another embodiment, the transcript abundance ranges is 

selected to be above 5 copies per cell. The mvention thus provides a method for selecting 

one or more siRNA functional motifs for targeting by siRNAs of a given concentration. 

The methods can be used for identifying one or more siRNA functional motifs that 
can be targeted by siRNAs of a given concentration with desired silencing efficacy. The 
given concentration is preferably m the low nanomolar to sub-nanomolar range, more 
preferably in the picomolar range. In specific embodiments, the given concentration is 50 
nmol, 20 nmol, 10 nmol, 5 nmol, 1 nmol, 0.5 nmol, 0.1 nmol, 0.05 nmol, or 0.01 nmoL The 
desh-ed sOencing efficacy is at least 50%, 75%, 90%, or 99% under a given concentration. 
Such methods are particularly useful for designing therapeutic siRNAs. For therapeutic uses, 
it is often desirable to identify siRNAs that can silence a target gene with high efficacy at 
sub-nanomolar to picomolar concentrations. The invention flius also provides a method for 
design of therapeutic siRNAs. 

The invention also provides a method for determining if a gene is suitable for targeting by a 
therapeutic siRNA. In one embodiment, the desired siRNA concentration and the desired 
silencing efficacy are first determined. A plurality of possible siRNA sequence motifs in the 
transcript of the gene is evaluated using a method of this invention. One or more siRNA 
sequence motifs that exhibit the highest efficacy, e.g., having PSSM scores satisfying the 
above described criterion or criteria, are identified. The gene is determined as suitable for 
targeting by a therapeutic siRNA if the one or more siRNA sequence motifs can be targeted 
by the corresponding siRNAs with silencing efficacy above or equal to the desired efficacy. 
In one embodiment, die plurality of possible siRNA sequence motifs comprises siRNA 
sequence motifs that span or are tiled across a part of or the entire transcript at steps of a 
predetermined base mtervals, e.g. at steps of 1, 5, 10, 15, or 19 base intervals. In a preferred 
embodiment, successive overiapping siRNA sequence motifs are tiled across the entire 
transcript sequence. In another preferred embodiment, successive overlapping siRNA 
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sequence motifs tiled across a region of or the entire transcript sequence at steps of 1 base 
intervals. 

5,2. METHODS OF IDENTTFYING OFF-TARGET GENES OF AN siRNA 

The invention also provides a method of identifying off-target genes of an siRNA. As * 
5 used herein, an "off-target" gene is a gene which is directly silenced by an siRNA that is 
designed to target another gene (see. International applicatioix No. PCT/US2004/0 15439 by 
Jackson et aL, filed on May 17, 2004, which is incorporated herein by reference in its 
entiiety). An off-target gene can be silenced by either the sense strand or the antisense strand 
of the SiRNA. 

10 5.2.1. SEQUENCE MATCH PROFILE AND OFF^TARGET SILENCING 

Microaixay experiments suggest that most siRNA oligos result in downregulation of 
off-target genes through direct uiteractions between an siRNA and the off-target transcripts. 
While sequence similarity between dsRNA and transcripts appears to play a role in 
determining which off-target genes are affected, sequence similarity searches, even 
15 combmed with thermodynamic models of hybridization, are insufficient to predict off-target 
effects accurately. However, alignment of off-target transcripts with offending siRNA 
sequences reveals that some base pairing interactions between the two appear to be more 
important than others (Fig. 6). 

The invention provides a method of identifying potential off-target genes of an 
20 siRNA using a PSSM that describes the sequence match pattern between an siRNA and a 
sequence of an off-target gene (pmPSSM). In one embodiment, the sequence match patt^ 
is represented by weights of different positions in an siRNA to match the corresponding 
target positions in off-target transcripts {P,}, where Pi is the weight of a match at position i, i 
= i, 2, I, where L is the length of the siRNA, Such a match pattern can be detOTnined 
25 based on the frequency with which each position m an siRNA is found to matdi affected off- 
target transcripts identified as dkect targets of the siRNA by' simultaneous downregulation 
with the intended target through kinetic analysis of expression profiles (see» hitemational 
application No. PCT/US2004/015439 by Jackson et al., filed on May 17, 2004). A 
pmPSSM can be {£,'}, where £/= Pi if position i in the alignment is a match and = 
30 P,)/3 if position i is a mismatdi. An exemplary {P,} for a 19mer siRNA sequence is plotted 
in FIG. 7 and listed in Table L 
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Table I Weights of an exemplary pmPSSM for 21nt siRNAs having a 19 nt duplex region 
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In one embodiment, sequence match pattern of off-target trasncripts are used to 
obtain a pmPSSM. Off-target genes of an siRNA can be identified using a method disclosed 
in International application No. PCT/US2004/015439 by Jackson et al., filed on May 17, 

25 2004, whidi is incorporated herein by reference in its entirety. For example, off-target genes 
of an siRNA are identified based on silencing kinetics (see, e.g., bitmiational application 
No. PCTAJS2004/015439 by Jackson et al., filed on May 17, 2004). A pmPSSM can then 
be generated using the frequency of matches found for each position. In one embodiment, 
the alignment shown in Fig. 6 and similar data for other siRNAs were combined to generate 

30 the exemplary position-specific scoring matrix for use in predicting off-target effects. 
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The degree of match between an siRNA and a sequrace in a transcript can be 
evaluated with the pmPSSM using a score (also referred to as a position match score» 
pmScore) according to the following equation 

Score ^j^hiiE,/ 0.25) (6) 

5 where L is the length of the alignmrat, e.g., 19. A pmScore above a given threshold 
identifies the sequence as a potential off-target sequence. 

The inventors have discovered that for a given siRNA the number of alignments with 
a score above a threshold is predictive of the number of observed off-target effects. The 
score threshold can be optimized by maximizing the correlation between predicted and 
10 observed numbers of off-target effects (Fig. 8). The optimized threshold can be used to favor 
selection of siRNAs with relatively small numbers of predicted off-target effects. 

5.2.2, METHOD OF IDENTIFYING OFF-TARGET GENES OF AN siRNA 

Oflf'target genes of a given siRNA can be identified by first identifying off-target 
transcript sequences that align with the siRNA, Any suitable method for pair- wise 
15 alignmwit, such as but not limited to BLAST and FASTA, can be used. Th.e position- 
specific scoring matrix is then used to calculate position match scores for these alignments. 
In a preferred embodiment, alignments are established with a low-stringency FASTA search 
and the score for each alignment is calculated according to Eq. 6. A score above a given 
threshold identifies the transcript comprising the sequence as a potential off-target gene. 

20 The invention thus also provides a method of evaluating the sUencing specificity of 

an siRNA. In one embodiment, potential off-target genes of the siRNA are identified. The 
total number of such off-target genes in the genome or a portion of the genome is then used 
as a measure of the silencing specificity of the siRNA 

5.3. METHOD FOR PREDICTION OF STRAND PREFERENCE OF siRNAS 

25 The invention provides a method for predicting strand preference and/or the efficacy 

and specificity of siRNAs based on position specific base composition of the siRNAs. The 
inventors have discovered that an siRNA whose base composition PSSM score (see Section 
5.1.) is greater than the base composition PSSM (G/C PSSM) score of its reverse complement 
is predicted to have an antisense strand that is more active than its sense strand. In contrast, 
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an siRNA whose base composition PSSM score is less than the base composition PSSM 
score of its reverse complement is predicted to have a sense strand that is more active than its 
antisense strand. 

It has been shown that increased efficacy of an siRNA in silencing a sense-identical 
5 target gene corresponds to greater antisense strand activity and lesser sense strand activity. 
The inventors have discovered that base composition PSSMs can be used to distinguish 
siRNAs with strong sense strands as bad siRNAs from siRNAs with weak sense strands as 
good siRNAs. The reverse complements of bad siRNAs were seen to be even more different 
from the bad siRNAs themselves than are good siRNAs. On the average, the reverse 
10 complements of bad siRNAs had even stronger G/C content at the 5' end than the good 
siRNAs did and were similar in G/C content to good siRNAs at the 3 ' end. In contrast, the 
reverse complements of good siRNAs were seen to be substantially more similar to bad 
siRNAs than the good siRNAs were. On the average, the reverse complements of good 
siRNAs hardly differed from bad siRNAs in G/C content at the 5' end and were only slightly 
15 less G/C rich than bad SiRNAs at the 3' end These results indicate that the G/C PSSMs 
distinguish siRNAs wi& strong sense strands as bad siRNAs from siRNAs with weak sense 
strands as good siRNAs. 

FIG. 14A shows the difference between the mean G/C content of the reverse 
complements of bad siRNAs with the mean G/C content of the bad siRNAs themselves, 
20 withm the 19mer siRNA duplex region. The difference between the mean G/C content of 
good and bad sDRNAs is shown for comparison. The curves are smoothed over a window of 
5 (or portion of a wmdow of 5, at the edges of the sequence). 

FIG. 14B shows the difference between the mean G/C content of the reverse 
complements of good siRNAs with the mean G/C content of bad siRNAs, within the 19mer 
siRNA duplex region. The difference between the mean G/C content of good and bad 
siRNAs is shown for comparison. The curves are smoothed over a window of 5 (or portion 
of a window of 3, at the edges of the sequence). 

In FIG. 15, siRNAs were biimed by measured silencing efficacy, and the frequency of 
sense-active calls by the 3'-biased method and G/C PSSM method was compared. Although 
these techniques are based on different analyses, the agreement is quite good. Both show that 
a higher proportion of low-silencing siRNAs vs. high-silencing siRNAs are predicted to be 
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sense active. The correlation coefficient for (siRNA G/C PSSM score - revise complement 
G/C PSSM score) vs. iogio(sense-identity score/antisense-identity score) is 0.59 for the set of 
61 siRNAs binned in FIG. IS. 

Hius, in one embodiment, the mvention provides a method for predicting strand 
5 preference, i.e., which of the two strands is move active, of siRNAs based on position 
specific base composition of the siRNAs. In one embodiment, the method contiprises 
evaluating the strand preference of an siRNA in gene silencing by comparing the base 
compositions of the sense and the antisense strands of the siRNA. In another embodiment, 
the method comprises evaluating the strand preference of an siRNA in gene silencing "by 
10 comparing the base compositions of the sense and the reverse complement of the target 
sequence of the siRNA. 

In one embodiment, the sequence of the antisense strand of an siRNA or the reverse 
complement of the target sequence of the siRNA in a transcript are con^>ared widi the target 
sequence using a PSSM approach (see Section 5.1.). An siRNA and its reverse complement 

15 are scored using a PSSM based on a smoothed G/C content difference between good and bad 
sLElNAs within the duplex region as the weight matrix. In one embodiment, a base 
composition weight matrix as described by FIG. 14A is used as the weight matrix. In a 
preferred embodunent, the PSSM score of each strand can be calculated as the dot product of 
the siRNA strand G/C content with the G/C content difference matrix (as the score 

20 calculation method of curve model PSSMs). In one embodiment, an siRNA is identified as 
sense-active if its reverse complement PSSM score exceeded its own PSSM score. 

In another embodiment, the 3'-biased method as described in International 
application No. PCT/US2004/015439 by Jackson et al., filed on May 17, 2004, whidr is 
incorporated herein by refi^ence in its entirety, is used in conjunction with the PSSM score to 
25 determining the strand preference of an siRNA. In such an embodiment, an siRNA is 

identified as sense-active by the 3 -biased method of strand preference determination if the 
antisense-identical score exceeded the sense-identical score. 

The method based on comparison of G/C PSSMs of siRNAs and their reverse 
complements for prediction of strand bias was tested by conq)arison with estimation of strand 
30 bias firom siRNA expression profiles by the 3'-biased method. 
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The invention also provides a method for identifying siRNAs having good silencing 
efficacy. The method comprises identifying siRNAs having dominant antisense strand 
activity ("antisense-active" siRNAs) as siRNAs having good silencing efficacy and 
specificity (for silencing sense-identical target). In one embodiment^ the method described 

5 in Section S.l. is used to identify siRNAs having good sense strand (i.e., idratifying siRNAs 
having good silencing efficacy towards an antisense-identical target). Sudi siRNAs are then 
eliminated from uses in silencing sense-identical targets. Hie method can also be used to 
eliminate siRNAs with dominant sense strand activity ("sense-active*' siRNAs) as siRNAs 
having less efficacy and specificity for silencing sense-identical targets. In one embodim^t, 

to the method described m fotemational application No. PCT/US2004/015439 by Jackson et 
al,, filed on May 17, 2004, which is incorporated herein by reference in its entirety, is used 
to determine strand preference of an siRNA. 

The reverse complements of bad siRNAs, on die average, appear to have a GC 
content profile which diff^ from that of bad siRNAs in the same manner as the GC content 
IS profile of good siRNAs differs from that of bad siRNAs. However, the reverse complements 
of bad siRNAs show even more extreme differences from bad siRNAs than do the good 
SiRNAs. 

This observation is in accord with the evidence in siRNA expression profiles that 
many bad siRNAs have active sense strands. 

20 The combination of data and analysis thus suggests that the reverse complements of 

bad siRNAs form an alternative, or perhaps even more advantageous, model for effective 
siRNAs than the good siRNAs do. Thus, die mvention also provides a method for selecting 
siRNAs based on the base composition of the sequence of a reverse complement of the sense 
strand of the siRNAs. In one embodiment, a plurality of different siRNAs designed for 

25 silencing a target gene in an organism at a different target sequence in a transcript of the 
target gene is ranked according to positional base composition of the reverse complement 
sequences of their sense strands. One or more siRNAs whose reverse complement 
sequences' positional base composition matches the positional base composition of desired 
siRNAs can then be selected. Preferably, the ranking of siRNAs is carried out by first 
• 30 determining a score for each different siRNA using a position-specific score matrix. The 

siRNAs are then ranked according to the score. Any method described in Section 5.1., supra, 
can be used to score reverse con^>lement sequences. In one embodiment, for siRNAs that 

55 

100005189 



wo 2005/042708 PCTAJS2004/035636 

have a nucleotide sequence of L nucleotides in tbe duplex region, L being an integer, the 
position-specific score matrix comprises a difference in probabihty of finding nucleotide G or 
C at sequence position k between reverse complement of a first type of siRNA and reverse 
complement of a second type of siRNA designated as Wit, ifc =i, L. The score for each 
5 reverse complement is calculated accordmg to equation * 

Score = 2^vvj (7) 

The first type of siRNA can consist of one or more siRNAs having silencing efficacy no less 
than a first threshold* e.g., 75%, 80% or 90% at a suitable dose, e.g., lOQnM, and the second 
type of siRNA can consist of one or more siRNAs havmg silencing efficacy less than a 
10 second threshold, e.g., 25%, 50%, or 75% at a suitable dose, e.g., lOQnM. In a preferred 
embodiment, the difference in probability is described by a sum of Gaussian curves, each of 
said Gaussian curves representing the difference in probability of finding a G or C at a 
different sequence position . 

The methods of this invention can also be applied to developing models, e.g., PSSMs, 
IS of siRNA functional motifs by training position-specific scoring matrices to distinguish 

between bad siRNAs and their reverse complements (see, e.g., Section 5.1.). A restriction ui 
this analysis is that the reverse complements of bad siRNAs have no designated targets. 
Thus, in one embodiment, position-specific scoring matrices of 19mer siRNA duplex 
sequences are trained to distinguish between bad siRNAs and their reverse complements. 

20 Flanking sequence training can be performed on ofF-target genes in the case of 

distinguishing between bad siRNAs and their reverse complements, as well as in the case of 
distinguishing between any two groups of siRNAs. In other words, the off-target activity of 
siRNAs can be hypothesized to have the same flanking sequence requkements as the on- 
target activity, as the same RNA-protein complexes are thought to be involved in both 

25 processes. 

Thus, if the methods of the off-target application are used to identify genes directly t 
down-regulated by an siRNA (i.e. through kinetic analysis of down-regulation to identify a 
group of genes down-regulated with the same half-life as the intended target), the regions 
flanking the alignment of the siRNA with the dire(^y regulated off-target genes can be used 
30 to train and test models of flankmg sequence requirements. These models can be developed 
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by any of the methods of this invention: random hill-climbing PSSMs, curve-model PSSMs, 
good*bad difference frequency matrices, good-composition frequency matrices, and/or bad- 
composition frequency matrices, etc. 

5 A METHODS OF DESIGNING siRNAS FQ l? HRNff? .sn ^NrTMH 

The invention provides a method for designing siRNAs for gene silencmg. The 
method can be used to design siRNAs that have full sequence homology to their respective 
target sequences in a target gene. The method can also be used to design siRNAs that have 
only partial sequence homology to a target gene. Methods and compositions for silmcing a 
target gene using an siRNA that has oiUy partial sequence homology to its target sequence in 
a target gene is disclosed in International application No. PCTAJS2004y015439 by Jackson et 
al., filed on May 17, 2004, which is incorporated herein by reference in its entirety. For 
example, an siRNA that comprises a sense strand contiguous nucleotide sequence of 1 1-18 
nucleotides that is identical to a sequence of a transcript of the target gene but the siRNA 
does not have full length homology to any sequences in the transcript may be used to silence 
the transcript Such contiguous nucleotide sequence is preferably in the central region of the 
siRNA molecules. A contiguous nucleotide sequence in the central region of an siRNA can 
be any continuous stretch of nucleotide sequence in the siRNA v^ch does not begin at the 3' 
end. For example, a contiguous nucleotide sequence of 11 nucleotides can be the nucleotide 
sequence 2-12, 3-13, 4-14, 5-15, 6-16, 7-17, 8-18, or 9-19. In preferred embodiments, the 
contiguous nucleotide sequence is 11-16, 11-15, 14-15, 11, 12, or 13 nucleotides in length. 
Alternatively, an siRNA that comprises a 3' sense strand contiguous nucleotide sequence of 
9-18 nucleotides which is identical to a sequence of a transcript of the target gene but which 
siRNA does not have full length sequence identity to any contiguous sequences in the 
transcript may also be used to silence the transcript. A 3' 9-18 nucleotide sequence is a 
continuous stretch of nucleotides that begins at the first paired base, i.e., it does not comprise 
the two base 3' overhang, h preferred embodiments, die contiguous nucleotide sequence is 
9-16, 9-15, 9-12, 11, 10, or 9 nucleotides in length. 

In preferred embodiments, the method of Section 5.1 is used for identifying from 
among a plurality of siRNAs one or more siRNAs that have high silencmg efficacy. In one 
embodiment, each siRNA in the plurality of siRNAs is evaluated for silencing efficacy by 
base composition PSSMs* In one embodiment, this step comprises calculating one or more 
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PSSM scores for each siRNA. The plurality of siRNAs are then ranked based on the score, 
and one or more siRNAs are selected using a method described in Section S.L4. 

In other preferred embodiments, the mediod of Section 52 is used for identifying 
from among a plurality of siRNAs one or more siRNAs tiiat have high silencing specificity. 
In one embodiment, alignments of eadi siRNA with sequences in each of a plurality of non- 
target transcripts are identified and evaluated with the pmPSSM approach (see Section 5.2.), 
A pmScore is calculated for each of the alignments. A pmScore above a given threshold 
identifies a sequence as a potential off-target sequence. Such a pmScore is also termed an 
alignment score. For example, when FASTA is used for the alignment, a pmScore can be a 
weighted FASTA alignment score. The transcript that comprises the potential off-target 
sequence is identified as a potential off-target transcript The total number of such off-target 
transcripts in the genome or a portion of the genome is used as a measure of the silencing 
specificity of the siRNA. One or more siRNAs having less off-target transcripts may then be 
selected. 

The siRNAs having the desired levels of efficacy and specificity for a transcript can 
be further evaluated for sequence diversity. In this disclosure, sequence diversity is also 
referred to as "sequence variety" or simply "diversity*' or "variety." Sequence diversity can 
be represented or measured based on some sequence characteristics. The siRNAs can be 
selected such that a plurality of siRNAs targeting a gene comprises siRNAs exhibiting 
sufficient difference in one or more of such diversity characteristics. 

Prefi^ably the sequence diversity characteristics used in the method of the invention 
are quantifiable. For example, sequence diversity can be measured based on GC content, the 
location of the siRNA target sequence along the length of the target transcript, or the two 
bases upstream of the siRNA duplex (i.e., the leading dimer, with 16 different possible 
leading dimers). The diff^nce of two siRNAs can be measured as the difference betweea 
values of a sequence diversity measure. The diversity or variety of a plurality of siRNAs can 
be quantitatively represented by the minimum difference or spacing in a sequence diversity 
measure between different siRNAs in die plurality. 

In the siRNA design method of the invention, the step of selection of siRNAs for 
diversity or variety is also referred to as a "de-overlap" stq). In a preferred embodiment, for 
a sequence diversity measure that is quantifiable, the de-overlapping selects siRNAs having 
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differences of a sequence diversity measure between two siRNAs above a given threshold. 

For example^ de-overlapping by position establishes a minimum distance between selected 

oligos along the length of the transcript sequence. In one embodiment, siRNAs positioned at 

least 100 bases apart in the transcript are selected. De-overlapping by GC content establishes 

a minimum difference in GC content. In one embodiment, the minimum difference in GC 

content is 1%, 2% or 5%. De-overlapping by leading dimers establishes the probability of all 

or a portion of the 16 possible leading dimers among the selected siRNAs. hi one 

embodiment, each of the 16 possible dimers is assigned a score of 1-16, and a 0.5 is used to 

selected all possible leading primer widi equal probability. 

In some embodiments, the candidates are preferably de-overlapped on GC content, 
with a minimum spacing of 5%, a maximum number of duplicates of each value of GC% of 
100 and at least 200 candidates selected; more preferably they are de-overlapped on GC 
content with a minimum spacing of 5%, a maximum number of duplicates of each value of 
GC% of 80 and at least 200 candidates selected; and still more preferably they are de- 
overlapped on GC content with a minimum spacing of S%, a maxinmm number of duplicates 
of each value of GC% of 60 and at least 200 candidates selected. 

siRNAs can be further selected based additional selection criteria. 

In one embodiment, siRNAs targeting sequences not common to all documented 
splice forms are eliminated. 

In another embodiment, siRNAs targeting sequraces overlapping with simple or 
mterspersed repeat elements are eliminated. 

In still another embodiment, siRNAs targeting sequences positioned at least 75 bases 
downstream of the translation start codon are selected. 

In another embodiment, siRNAs targeting sequences overlapping or downstream of 
the stop codon are eliminated. This avoids targeting sequmces absent in undocumented 
alternative polyadenylation forms. 

In still another embodiment, siRNAs with GC content close to 50% are selected. In 
one embodiment, siRNAs with GC% < 20% and > 70% are eliminated. In another 
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embodiment, 10% < GC% < 90%, 20% < GC% < 80%, 25% < GC% < 75%. 30% < GC% < 

70% are retained 

In still another embodiment, siRNAs targeting sequence containing 4 consecutive 
guanosiae, cytosine, adenine or uracil residues are eliminated. In still another embodiment, 
5 siRNAs targeting a sequence widi a guanine or cytosine residue at the first position in die 
19mer duplex region at the 5' md are selected. Such siRNAs target sequences fliat are 
effectively transcribed by RNA polymerase HI. 

In still another embodiment, siRNAs targeting a sequence containing recognition 
sites for one or more given restriction endonucleases, e.g., Xhol or EcoRI restriction 
10 endonucleases, are eliminated. This embodiment may be used to select siRNAs sequences 
for construction of the shRNA vectors. 

In still anodia: embodiment, the siRNAs are evaluated for binding energy. See WO 
01/05935 for an exemplary method of determining bindmg energy. In a preferred 
embodunent, the binding energy is evaluated by calculating the nearest-neighbor 21mer AG. 

In still another embodiment, the siRNAs are evaluated for binding specificity. See 
WO 01/05935 for an exemplary method of determming binding specificity of a 21mer. In a 
preferred embodiment, the binding specificity is evaluated by calculating a 21mer minimax 
score against the set of unique sequence representatives of genes of an organism, e.g., the set 
of unique sequences representatives for each cluster of Homo sapiens Unigene build 161 
(http://www.ncbi.nlm.nih.gov/entre2/queryicgi?db==iinigene). 

In still another embodiment, the method for predicting strand preference and/or the 
efficacy and specificity of siRNAs based on position specific base composition of the 
siRNAs as described in Section 5.3. can be used to evaluate the siRNA candidates. 

A flow chart of an exemplary embodiment of the method used to select the siRNAs is 
25 shown in FIG. 9. 

In step 101, siRNA sequences that target a transcript are selected In one 
embodiment, all 19mer subsequences of the transcript are considered. The appropriate 
flanking sequences for eadi siRNA sequence are also obtained and considered. The siRNAs 
arc evaluated against the following filters: (1) eliminating siRNAs targeting sequences not 
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common to all documented splice forms; (2) eliminating siRNAs targeting sequences 
ovCTlapping with simple or interspersed repeat elements; (3) eliminating siRNAs targeting 
sequences positioned within 73 bases downstream of the translation start codon; and (4) 
eliminating siRNAs overlapping or downstream of the stop codon. 

For shRNA selection, the following steps are also taken: (5) eliminating siRNAs 
targeting sequence containing 4 consecutive guanosine, cytosine, adenine or uracil residues; 
(6) retaining siRNAs targeting a sequence with a guanine or cytosine residue at the first 
position in the 19mer duplex region at the 5' end; and (7) eliminating siRNAs targeting a 
sequence containmg recognition sites for one or more given restriction enzymes, e.g., Xhol 
or EcoRI restriction endonucleases, if siRNAs sequences used in constmction of the shRNA 
vectors. 

Id step 102, the siRNA is evaluated for silencing efQcacy by base composition 
PSSMs. In one embodiinent, step 102 comprises calculating a first PSSM score, Le., the 
PSSM-1 score, and a second PSSM score, Le., the PSSM-2 score, for an siRNA. The two 
scores are sum to calculate the combined PSSM-l+PSSM-2 score for the siRNA. to one 
embodunent, the PSSMs used are those whose performance is shown in Figure 2. The 
siRNA is retained if the combined score is above a given threshold. 

The siRNA is then evaluated for its binding energy by calculating the nearest- 
neighbor 21mer AG. The siRNA is then evaluated for binding specificity by calculating a 
21mer minimax score against the set of unique sequence representatives of genes of an 
organism, e.g., the set of unique sequences representatives for eadi cluster of Homo sapiens 
Unigene bufld 161. See WO 01/05935 for m^ods of calculating the AG and the nunimax 
score. In one embodiment, the parameters for the BLAST alignments and nearest-neighbor 
delta-G calculations based on the BLAST alignments, wliich are used to compute minimax 
scores, are as follows: -p blastn -e 100 -F F -W 11 -b 200 -v 10000 -S 3; and delta-G: 
temperature 66°; salt IM; concentration IpM; type of nucleic acid, RNA. In one 
embodiment, the siRNA is eliminated if the (21mer AG - 21mer minimax) < 0.5. 

In step 103, siRNAs are screened for overall GC content. In one embodiment, 
SiRNAs vdth GC content significantly deviated fcom 50%, e.g., GC% < 20% and > 70%, are 
eliminated. 
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In step 104, siRNAs are screened for diversity or variety. Position simply refers to 

the position of the oligo in the transcript sequence and is automatically provided by 

identifying the oligo. Variety is enforced in one or more "de-overlap" steps in the method 

Briefly, de-overlapping selects for above-threshold spacing between selected oligos in some 

calculable parameter. To de-overlap, oligos are first ranked according to some parameter 

thought to distinguish better from poorer performers and dicn selected for spacmg between 

oligos according to some otha: parameter. To begin, the top ranked oligo is selected. Then 

the ranked list is examined, and the next-best oligo with at least the minimum required 

spacing from the selected oligo is selected. Then the next-best oligo with at least the 

minimum spacing from the two selected oligos is also selected. The process continues until 

the desired number of oligos is selected. In one embodiment, multiple oligos may share the 

same value if a parameter is few-valued, and the number of oligos sharing the same value is 

limited by a set threshold. In one embodiment, if an insufficient number of oligos is seleaed 

in a first pass of de-overlapping, the spacing requirement can be relaxed until the desired 

number, or the set of all remaining available oligos, is selected. 

For example, de-overlapping by position establishes a minimum distance between 
selected oligos along the lengdi of the transcript sequence. In one embodiment, siRNAs are 
tanked by a PSSM score and the ranked siRNAs positioned at least 100 bases apart in the 
transcript are selected. De-overlapping by GC content establishes a minimum difference in 
GC content. In one embodiment, the minimum difference in GC content is 1%, 2% or 5%, 
Duplicates are allowed for few-valued parameters such as the GC% of a I9mer. De- 
overlapping by leading dimers establishes the probability of all or a portion of the 16 possible 
leading dimers among the selected siRNAs. In one embodiment, each of the 16 possible 
dimers is assigned a score of 1-16, and a 0.5 is used to selected all possible leading primer 
with equal probability, i.e., to distribute candidate siRNAs over all possible leading dimer 
values. 

De-overlapping with different parameters may be combined. 

In step 105, off-target activity of an siElNA is evaluated according to the method 
described in Section 5.2. Alignments of each siRNA with sequences in each of a plurality of 
non-target transcripts are identified and evaluated with a pmPSSM using a pmScore 
calculated according to equation (6). A pmScore above a given threshold identifies the 
sequence as a potential off-target sequace. The transcript that comprises the potential off- 
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target sequence is identified as a potential off-target transcript The total number of such off- 
target transcripts in the genome or a portion of the genome is used as a measure of the 
silencing specificity of the siRNA. One or more siRNAs having less off-target transcripts are 
selected. 

5 In one embodiment, transcripts of genes are scanned using FASTA with the 

parameters: KTUP 6 -r 3/-7 -g -6 -f -6 -d 14000 -b 14000 -E 7000. A pmScore is determined 
for each alignment as described in Section 5.2. The FASTA weighted score is used to: (1) 
quantify the nearest sequence match to the candidate siRNA; and (2) count the total matches 
to the candidate siRNA with weighted scores over a threshold The total number of such off- 

10 target genes in the genome or a portion of the genome is then used as a measure of the 
silencing specificity of the siRNA. 

In a preferred embodunent, the selected siRNAs are subjected to a second round of 
selection for variety (step 106), and re-ranked by their base composition PSSM scores (step 
107). The desked number of siRNAs is retained from the top of this final ranking (step 108). 

15 The invention also provides a method for selecting a plurality of siRNAs for each of a 

plurality of different genes, each siRNA achieving at least 75%, at least 80%, or at least 90% 
silencing of its target gene. The mettiod described above is used to select a plurality of 
siRNAs for each of a pltirality of genes. Preferably, the plurality of siNRAs consists of at 
least 3, 5, or 10 siRNAs. Preferably, the plurality of different genes consists of at least 100. 

20 500, 1,000, 5.000, 10,000 or 30,000 different gpnes. 

The invention also provides a library of siRNAs which comprises a plurality of 
siRNAs for each of a plurality of different genes, eadi siRNA achieves at least 75%, at least 
80%, or at least 90% silencing of its target gene. The standard conditions are 100 nM siRNA, 
silencing assayed by TaqMan 24 hours post-transfection. Preferably, the plurality of siNRAs 
25 consists of at least 3, at least 5, or at least 10 siRNAs. Preferably, the plurality of different 
genes consists of at least 10, 100, 500, 1,000, 5.000, 10,000 or 30,000 different genes. 

S.S. METHODS AND COMPOSITIQNS FOR RNA INTERFERENCE AND CELL 

ASSAYS 

Any standard method for gene silencing can be used in conjunction with the present 
30 invention, e.g., to carry our gene silencing using siRNAs designed by a method described in 
the present invention (see, e.g..Guoer a/., 1995. CeU 81:611-620; Fire at, 1998, Nature 
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39l:806-811;Graiit, 1999, Cell 96:303-306; Tabara etal., 1999, CeU 99:123-132; Zamore et 
al., 2000, Cell 101:25-33; Bass. 2000, Cell 101:235-238; Petcherski etal., 2000, Nature 
405:364-368; Elbashir etal.. Nature 411:494-498; Paddison et at, Proc. Natl. Acad. Sci. 
USA 99:1443-1448). In one embodiment, gene silencing is induced by presenting the cell 
5 with the siRNA, mimicking the product of DicCT cleavage (see, e.g., Elbashir et al., 2001, 
Nature 411, 494-498;. Elbashir et al., 2001, Genes Dev. 15, 188-200, all of which are 

r 

incorporated by reference herein in their entirety). Synthetic siRNA duplexes maintain the 
ability to associate with RISC and direct silencing of mRNA transcripts. siRNAs can be 
chemically synthesized, or derived from cleavage of double-stranded RNA by recombinant 
10 Dicer. Cells can be transfected with the siRNA using standard method known m the art. 

In one embodiment, siRNA transfection is carried out as follows: one day prior to 

transfection, 100 microliters of chosen cells, e.g,, cervical cancer HeLa cells (ATCC, Cat. 

No. CCL-2), grown in DMEM/10% fetal bovme serum (Invitrogen, Carlsbad, CA) to 

approximately 90% confluency are seeded in a 96- well tissue culture plate (Coining, 
15 Coming, NY ) at 1500 cells/well. For each transfection 85 microliters of OptiMEM 

(Invitrogen) is mixed with 5 microliter of serially diluted siRNA (Dhanna on, Denver) foom a 

20 micro molar stock. For each transfection 5 microliter OptiMEM is mixed with 5 

microliter Oligofectamine reagent (Invitrogen) and maibated 5 minutes at room temperature. 

The 10 microliter OptiMEM/Oligofectamine mixture is dispensed into eacii tube with the 
20 OptiMEM/siRNA mixture, mixed and incubated 15-20 minutes at room temperature. 10 

microliter of the transfection mixture is aliquoted into each well of the 96-well plate and 

maibated for 4 hours at 37°C and 5% CO2. 

In one embodiment, RNA interference is carried out using pool of siRNAs. In a 
preferred embodiment, an siRNA pool containing at least k (k = 2, 3, 4, 5, 6 or 10) different 

25 siRNAs targeting a target gene at different sequence regions is used to transfect the cells. In 
another preferred embodiment, an siRNA pool containing at least k (k = 2, 3, 4, 5, 6 or 10) 
different siRNAs targeting two or more different target genes is used to supertransfect the 
cells. In a preferred embodiment, the total siRNA concentration of die pool is about the same 
as the concentration of a single siRNA when used individually, e.g., lOOnM. Preferably, the 

30 total concentration of the pool of siRNAs is an optimal concentration for silencing the 

intended target gene. An optimal concentration is a concentration further increase of which 
does not increase the level of silencing substantially. In one embodiment, the optimal 
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concentration is a concentration further increase of which does not increase the level of 
silencing by more than 5%, 10% or 20%. In a preferred embodiment, the composition of the 
pool, including the number of different siRNAs in the pool and the concentration of each 
different siRNA, is chosen such that the pool of siRNAs causes less than 30%, 20%, 10% or 

5 5%, 1%. 0.1% or 0.01% of silencing of any off-target genes. In another preferred 

embodiment, the concentration of each different siRNA in die pool of different siRNAs is 
about the same. Jn still anodier preferred embodiment, the respective concentrations of 
diffident siRNAs in the pool are different fiom each other by less than 5%, 10%, 20% or 
50%. In still another preferred embodiment, at least one siRNA in the pool of different 

10 siRNAs constitutes more than 90%, 80%, 70%, 50%, or 20% of the total siRNA 

concentration in the pool. In still another preferred embodiment, none of the siRNAs in the 
pool of different siRNAs constitutes more than 90%, 80%, 70%, 50%, or 20% of flie total 
siRNA concentration in the pool. In other embodiments, each siRNA in the pool has an 
concentration that is lower than the optimal concentration when used individually. In a 

15 preferred embodimrat, each different siRNA in tfie pool has an concentration that is lower 
than die concentration of the siRNA that is effective to adueve at least 30%, 50%, 75%, 80%, 
85%, 90% or 95 % silencmg when used in the absence of other siRNAs or in the absence of 
other siRNAs designed to silence the gene. In another preferred embodiment, each different 
siRNA in the pool has a concentration that causes less than 30%, 20%, 10% or 5% of 

20 silencing of the gene when used in the absence of other siRNAs or in the absence of other 
siRNAs designed to silence the gene. In a preferred embodiment, each siRNA has a 
concentration that causes less than 30%, 20%, 10% or 5% of silencing of the target gene 
when used alone, while the plurality of siRNAs causes at least 80% or 90% of silencing of 
the target gene. 

25 Another method for gene silencing is to introduce into a cell an shRNA, for short 

hairpin RNA (see, e.g., Paddison et al., 2002, Genes Dev. 16, 948-958; Brununelkamp et al., 
2002, Science 296, 550-553; Sui, G. et al. 2002, Proc. Natl Acad. ScL USA 99, 5515-5520, 
all of whidi are incorporated by reference herein in their entirety), which can be processed in 
the cells into siRNA, In this method, a desired siRNA sequaice is expressed from a plasmid 

30 (or virus) as an inverted repeat with an intervening loop sequence to form a hairpin structure. 
The resulting RNA transcript containing the hairpin is subsequently processed by Dicer to 
produce siRNAs for silencing. Plasmid-based shRNAs can be expressed stably in cells, 
allowing long-term gene silencing in cells both in vitro and in vivo, e.g., in animals (see, 
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McCaffrey et al. 2002, Nature 418, 38-39; Xia et al., 2002, Nat Biotech. 20, 1006-1010; 

Lewis et al., 2002, Nat. Genetics 32, 107-108; Rubinson et al., 2003, Nat Genetics 33, 401- 

406; Tiscomia et al„ 2003, Proc. Notl Acad. Sci USA 100, 1844-1848, all of which are 

incorporated by reference herein in their entirety). Thus, in one embodiment, a plasmid- 

5 based shRNA is used. 

In a preferred embodiment, shRNAs are expressed from recombmant vectors 
introduced either transiently or stably integrated into the genome (see, e.g., Paddison et oLy 
2002, Genes Dev 16:948-958; Sui et al, 2002, Proc Natl Acad Sci USA 99:5515-5520; Yu 
et al., 2002, Proc Natl Acad Sci USA 99:6047-6052; Miyagishi et al., 2002, Nat Biotechnol 

10 20:497-500; Paul et al., 2002, Nat Bioteclmol 20:505-508; Kwak et al., 2003, / Pharmacol 
Sci 93:214-217; Bnimmelkamp et al., 2002, Science 296:550-553; Boden et al, 2003, 
Nucleic Acids Res 31:5033-5038; Kawasaki et al., 2003, Nucleic Acids Res 31:700-707). The 
siRNA that disrupts the target gene can be expressed (via an shRNA) by any suitable vector 
which encodes the shRNA. The vector can also encode a marker which can be used for 

15 selecting clones in which the vector or a sufficient portion thereof is integrated in the host 
genome sudi that the shRNA is expressed. Any standard method known in the art can be 
used to deliver the vector into the cells. In one embodunent, cells expressing the shRNA are 
generated by transfectmg suitable cells with a plasmid containing the vector. Cells can then 
be selected by die appropriate marker. Clones are then picked, and tested for knockdown. In 

20 . a preferred embodiment, a plurality of recombinant vectors are introduced into the genome 
such that the expression level of the siRNA can be above a given value. Such an embodiment 
is particular useful for silencmg genes whose transcript level is low in the cell. 

In a preferred embodiment, the expression of the shRNA is under the control of an 
inducible promoter such that the silencing of its target gene can be turned on when desired. 

25 Inducible expression of an siRNA is particularly useful for targetmg essential genes. In one 
embodiment, the expression of the shRNA is under the control of a regulated promoter that 
allows tuning of ttie silencmg level of the target gene. This allows soeening against cells in 
whidi the target gene is partially knocked out As used herein, a "^regulated promote' refers 
to a promote that can be activated when an appropriate inducing agent is present An 

30 "inducing agmt*' can be any molecule that can be used to activate transcription by activating 
the legulated promoter. An inducing agent can be, but is not limited to, a peptide or 
polypeptide, a hormone, or an organic small molecule. An analogue of an inducing agent. 



66 




wo 2005/042708 



PCT/US2004/035636 



10 



15 



20 



25 



' 30 



i.e., a molecule that activates the regulated promoter as the inducing agent does» can also be 
used. The level of activity of the regulated promoter induced by different analogues may be 
different, thus allowing more flexibility in tuning the activity level of the regulated promoter. 
The regulated promoter in the vector can be any mammalian transcription regulation system 
known in the art (see, e.g., Gossen et al, 1995, Science 268:1766-1769; Lucas et al, 1992, 
Annu. Rev. Biochem. 61:1131; li et al., 1996, Cell 85:319-329; Saez et al., 2000, Proc. Natl. 
Acad. Sci. USA 97:14512-14517; and Pollock et al., 2000, Proc. Natl, Acad. Sci. USA 
97:13221-13226). In preferred embodiments, the regulated promoter is regulated in a dosage 
and/or analogue dependent manner. In one embodiment, the level of activity of the regulated 
promoter is tuned to a desired level by a method comprising adjusting the concentration of 
the inducing agent to which the regulated promoter is responsive. The desired level of 
activity of the regulated promoter, as obtained by applying a particular concentration of the 
inducing agent, can be determined based on the desired silencing level of the target gene. 

In one embodiment, a tetracycline regulated gene expression system is used (see, e.g., 
Gossen et ai; 1995, Science 268:1766-1769; U.S. Patent No. 6,004,941). A tet regulated 
system utilizes components of &e tet repressor/operatoi/inducer system of prokaryotes to 
regulate gene expression in eukaryotic cells. Thus, the invention provides mediods for using 
the tet regulatory system for regulating the expression of an shRNA linked to one or more tet 
operator sequences. The methods involve introducing into a cell a vector encoding a fusion 
protein that activates transcription. The fusion protein comprises a first polypeptide that 
binds to a tet opaator sequence in the presence of tetracycline or a tetracycline analogue 
operatively linked to a second polypeptide that activates transcription in cells. By modulating 
the concentration of a tetracycliae, or a tetracycline analogue, expression of the tet 
operator^-linked shRNA is regulated. 

In other embodiments, an ecdyson regulated gene expression system (see, e.g., Saez et 
al., 2000, Proc. Natl. Acad. Sci. USA 97:14512-14517), or an MMTV glucocorticoid 
response element regulated gcoQ expression system (see, e.g., Lucas et al, 1992, Annu. Rev. 
Biochem. 61:1 131) may be used to reguliate the expression of the shRNA. 

In one embodiment, the pRETRO-SUPER (pRS) vector which encodes a puromycin- 
resistance marker and drives shRNA expression from an HI (RNA Pol HI) promoter is used. 
The pRS-shRNA plasmid can be generated by any standard method known in the art. In one 
embodiment, the pRS-shRNA is deconvolved from the library plasmid pool for a chosen 
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gene by transfonning bacteria with the pool and looking for clones containing only the 
plasmid of interest. Preferably, a 19nier siRNA sequence is used along with suitable forward 
and reverse primers for sequence specific PGR. Plasmids are identified by sequence specific 
PGR, and confirmed by sequencing. Gells expressing the shRNA are generated by 
5 transfecting suitable cells with the pRS-shRNA plasmid. Cells are selected by the 

appropriate marker, e.g., puromycin, and maintained until colonies are evident Glones are ^ 
then picked, and tested for knockdown. In anoflier embodiment, an shRNA is expressed by a 
plasmid, e.g., a pRS-shRNA. The knockdown by the pRS-shRNA plasmid, can be achieved 
by transfecting cells using Lipofectamine 2000 (Invitrogen). 

10 In yet another method, siRNAs can be delivered to an organ or tissue in an animal, 

such a human, in vivo (see, e.g.. Song et al. 2003, Nat Medicine 9, 347-351; Sorensen et aL, 
2003, J. MoL Biol. 327, 761-766; Lewis et al., 2002, Nat. Genetics 32, 107-108, all of which 
are incorporated by reference herein in their entirety). In this method, a solution of siRNA is 
injected intravenously into the animal. The siRNA can then reach an organ or tissue of 

15 interest and effectively reduce the expression of the target gene in the organ or tissue of the 
animal. 

The siRNAs can also be delivered to an organ or tissue using a gene therapy 
approach. Any of the methods for gene therapy available in the art can be used to deliver the 
siRNA. For general reviews of the methods of gene therapy, see Goldspiel et al., 1993, 

20 Clinical Pharmacy 12:488-505; Wu and Wu, 1991, Biotherapy 3:87-95; Tolstoshev, 1993, 
Ann. Rev. Pharmacol. Toxicol. 32:573-596; Mulligan, 1993, Science 260:926-932; and 
Morgan and Anderson, 1993, Ann. Rev. Biochem. 62:191-217; May, 1993, TIBTECH 
11(5):155-215). In a preferred embodiment, the therapeutic comprises a nucleic acid 
encoding the siRNA as a part of an expression vector. In particular, such a nucleic acid has a 

25 promoter operably linked to the siRNA coding region, in which the promoter being inducible 
or constitutive, and, optionally, tissue-specific. In another particular embodiment, a nucleic 
acid molecule in which the siRNA coding sequence is flanked by regions that promote 
homologous recombination at a desired site in the genome is used (see e.g., Koller and 
Smithies, 1989, Proc. Nati. Acad. Sd. U.S.A. 86:8932-8935; Hjlstra et al., 1989, Nature 

30 342:435-438). 

In a specific embodiment, the nucleic acid is directly administered in vivo. This can 
be accomplished by any of numerous methods known in the art, e.g., by constructing it as 
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part of an appropriate nucleic acid expression vector and administering it so that it becomes 
intracellular, e.g., by infection using a defective or attenuated retroviral or other viral vector 
(see U.S. Patent No. 4,980,286), or by direct injection of naked DNA» or by use of 
microparticle bombardment (e.g., a gene gun; Biolistic, Dupont), or coating widi lipids or 
* 5 cell-surface receptors or transfecting agents, encapsulation in liposomes, micropartides, or 
microcapsules, or by administering it in linkage to a peptide which is known to enter the 
nucleus, by administering it in linkage to a ligand subject to receptor-mediated radocytosis 
(see e.g., Wu and Wu, 1987. J. Biol. Chem. 262:4429-4432) (which can be used to target cell 
types specifically expressing the receptors), etc. hi another embodiment, a nucleic acid- 
10 ligand complex can be formed in whidi the ligand comprises a fiisogenic viral pqptide to 
disrupt endosomes, allowing the nucleic acid to avoid lysosomal degradation. In yet another 
embodiment, the nucleic acid can be targeted m vivo for cell specific uptake and expression, 
by targeting a specific recqptor (see. e.g., PCX Publications WO 92/06180 dated April 16. 
1992 (Wu et alO; WO 92/22635 dated December 23, 1992 (Wilson et al); WO92/20316 
15 dated November 26, 1992 (Findeis et al.); W093/14188 dated July 22, 1993 (Clarke et al.), 
WO 93/20221 dated October 14, 1993 (Young)). Altematively, the nucleic acid can be 
introduced intracellularly and incorporated widiin host cell DNA for expression, by 
homologous recombmation (Roller and Smifliies, 1989, Proc. Natl. Acad. Sci. U.S.A. 
86:8932-8935; Zijlstra et al., 1989, Nature 342:435-438). 

20 In a specific embodunent, a vkal vector that contains the siRNA coding nucleic acid 

is used. For example, a retroviral vector can be used (see Miller et al., 1993, Meth. Enzymol. 
217:581-599). These retroviral vectors have been modified to delete retroviral sequences that 
are not necessary for packaging of the viral genome and integration into host cell DNA. The 
siRNA coding nucleic acid to be used in gene dierapy is cloned into the vector, which 

25 facilitates delivery of the gene into a patient. More detail about retroviral vectors can be 
found in Boesen et al., 1994, Bioth^py 6:291-302, which describes the use of a retroviral 
vector to deliver the mdrl gene to hematopoietic stem cells in order to make the stem cells 
more resistant to chemotherapy. Other references illustrating the use of retroviral vectors in 
gene therapy are: Qowes et al., 1994, J. Clin. Invest 93:644-651; Kiem et al.. 1994, Blood 

30 83:1467-1473; Salmons and Gunzberg, 1993, Human Gene Therapy 4:129-141; and 
Grossman and Wilson, 1993, Curr. Opin. Genet, and Devel. 3:1 10-1 14. 
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Adenoviruses are other viral vectors that can be used in gene therapy. Adenoviiuses 

are especially attractive vehicles for delivering genes to respiratory epithelia. Adenoviruses 

naturally infect respiratory epithelia where they cause a nuld disease. Other targets for 

adenovirus-based delivery systems are liver, the central nervous system, endothelial cells, and 

5 muscle. Adenoviruseshavetheadvantageof being capable of infecting non-dividing cells. 
Kozarsky and Wilson (1993, Current Opinion in Genetics and Development 3:499-503) 
present a review of adenovirus-based gtat therapy. Bout et al. (1994, Human Gene Ther^y 
5:3-10) demonstrated the use of adenovirus vectors to transfer genes to the respiratory 
epithelia of rhesus monkeys* Other instances of the use of adenoviruses in goie therapy can 

10 be found in Rosenfeld ct al., 1991, Science 252:431-434; Rosenfeld et al., 1992, Cell 68:143- 
155; and Mastrangeli et al, 1993, J. Clin, hivest. 91:225-234. Adeno-associated virus (AAV) 
may also beCT used in gene therapy (Walsh et al., 1993, Proc. Soc. Exp. Biol Med. 204:289- 
300). 

Degree of silencing can be determined using any standard RNA or protein 
15 quantification method known in the art For example, RNA quantification can be performed 
using Real-time PCR, e.g., using AP Biosystems TaqMan pre-developed assay reagent 
(#43 19442). Primer probe for the appropriate gene can be designed using any standard 
method known in the art, e.g. using Primer Express software. RNA values can be normalized 
to RNA for actin (#4326315). Protein levels can be quantified by flow cytometry following 
20 staining witih appropriate antibody and labeled secondary antibody. Protein levels can also be 
quantified by western blot of cell lysates with appropriate monoclonal antibodies followed by 
Kodak image analysis of chemiluminescent immunoblot Protein levels can also be 
normalized to actm levels. 

Effects of gene silencing on a cell can be evaluated by any known assay. For 
25 example, cell growfii can be assayed using any suitable proliferation or growth inhibition 
assays known in the art. In a preferred embodiment, an MTT proliferation assay (see, e.g., 
van de Loosdrechet, et al., 1994, J. Immunol. Methods 174: 311-320; Ohno et al., 1991, J. 
Inununol. Mediods 145:199-203; Ferrari et al., 1990, J. Immunol. Methods 131: 165-172; 
Alley et al., 1988, Cancer Res. 48: 589-601; Carmichael et al., 1987, Cancer Res, 47:936- 
30 942; Gerlier et al., 1986, J. Immunol Methods 65:55-63; Mosmann, 1983, J. Immunological 
Methods 65:55-63) is used to assay the effect of one or more agents in inhibiting the growth 
of cells. The cells are treated with chosen concentrations of one or more candidate agents for 
a diosen period of time, e.g., for 4 to 72 hours. The cells are then incubated with a suitable 
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amount of 3-(4,5-KlimethyltWazol-2-yl)-2,5-diphenyltetra2oliu^ (MTT) for a chosen 

period of time, e.g., 1-8 hours, such that viable cells conv^ MTT into an intracellular deposit 
of insoluble foimazan. After removing the excess MTT contained in the supernatant, a 
suitable MTT solvent, e.g., a DMSO solution, is added to dissolved the formazan. The 
5 concentration of MTT, which is proportional to the number of viable cells, is then measured 
by detmnining the optical density at e.g., 570 nm, A plurality of different concentrations of 
the candidate agent can be assayed to allow the determination of the concentrations of the 
candidate agent or agents ^^ch causes S0% inhibition. 

in another prefi^red embodiment, an alamarBlue'^ Assay for cell proliferation is used 
10 to screen for one or more candidate agents that can be used to inhibit the growth of cells (see. 
e.g.. Page et al., 1993, Int J. Oncol. 3:473-476). An alamarBlue™ assay measures cellular 
respiration and uses it as a measure of the number of living cells. The internal enviromnent 
of proliferating cells is more reduced than that of non-proliferating cells. For example, the 
ratios of NADPH/NADP, FADH/FAD, FMNH/FMN, and NADH/NAF increase during 
15 proliferation. AlamarBlue can be reduced by these metabolic intermediates and, therefore, 
can be used to monitor cell proliferation. The cell number of a treated sample as measured by 
alamarBlue can be expressed in percent relative to that of an untreated control sample. 
alamarBlue reduction can be measured by either absorption or fluorescence spectroscopy. In 
one embodiment, the alamarBlue reduction is deteraxined by absorbance and calculated as 
20 percent reduced using the equation: 

where: 

Xi = 570nm 

^2 = 600nm 

25 (Ered h) ~ 155,677 (Molar extinction coefBcient of reduced alamarBlue at 570 nm) 
(e«d ^2) = 14,652 (Molar extinction coefficient of reduced alamarBlue at 600 nm) 
(Eox A.i) = 80,586 (Molar extinction coefficient of oxidized alamarBlue at 570 nm) 
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(€ox ^ = 1 17,216 (Molar extinction coefficient of oxidized alamarBlue at 600 nm) 



(A Xi) = Absorbance of test wells at 570 nm 
(A Xi) = Absorbance of test wells at 600 nm 

(A'^i) = Absorbance of negative control wells which contain medium plus alamar Blue but to ^ 
5 which no cells have been added at 570 nm. 

( AX2) = Absorbance of negative control wells which contain medium plus alamar Blue but to 
which no cells have been added at 600 nm. Preferably, the % Reduced of wells containing no 
cell was subtracted from the % Reduced of wells containing samples to determine the % 
Reduced above background. 

10 Cell cycle analysis can be carried out using standard method known in the art, hi one 

embodiment, the supernatant from each well is combined with the cells that have been 
harvested by trypsinization. The mixture is then centrifiiged at a suitable speed. The cells are 
then fixed with, e.g., ice cold 70% ethanol for a suitable period of time, e.g., - 30 minutes. 
Fixed cells can be washed once with PBS and resuspended, e.g., in 0.5 ml of PBS containing 

15 Propidium Iodide (10 microgram/ml) and RNase A (Img/ml), and incubated at a suitable 
temperature, e.g., 37°C, for a suitable period of time, e.g., 30 min. Flow cytometric analysis 
is then carried out using a flow cytometer. In one embodiment, the Sub-Gl cell population 
is used as a measure of cell death. For example, the cells are said to have been sensitized to 
an agent if the Sub-Gl population from the sample treated with the agent is larger than the 

20 Sub-Gl population of sample not treated with the agent. 

5.6. IMPLEMENTATION SYSTEMS AND METHODS 
The analytical methods of the present invention can preferably be implemented using 
a computer system, such as the computer system described in this section, according to the 
following programs and methods. Such a computer system can also preferably store and 
25 manipulate measured signals obtained in various experiments that can be used by a computer 
system implemented with the analytical methods of this invention. Accordingly, such 
computer systems are also considered part of the present invention. 

An exemplary computer system smtable from implementing the analytic methods of 
this invention is illustrated in FIG. 12. Computer system 1201 is illustrated here as 
30 comprising internal components and as being linked to external components. The internal 
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components of this computer system include one or more processor elements 1202 
interconnected with a main memory 1203. For example, computer system 1201 can be an 
Intel Pentium IV®-based processor of 2 GHZ or greater clock rate and with 256 MB or more 
main memory. In a preferred embodiment, computer system 1201 is a cluster of a plurality 
' 5 of computers comprising a head "node" and eight siblmg "nodes," with each node having a 
central processing unit ("CPU")- In addition, the cluster also comprises at least 128 MB of 
random access memory (**RAM") on the head node and at least 256 MB of RAM on each of 
die eight sibling nodes. Therefore, the compute systems of the present invention are not 
limited to those consisting of a smgle memory unit or a single processor unit. 
JO The external components can include a mass storage 1204. This mass storage can be 

one or more hard disks that are typically packaged together with the processor and memory. 
Such hard disk are typically of 10 GB or greater storage capacity and more preferably have at 
least 40 GB of storage capacity. For example, in a preferred embodiment, described above, 
wherein a computer system of the invention comprises several nodes, each node can have its 
15 own hard drive. The head node preferably has a hard drive with at least 10 GB of storage 
capacity whereas each sibling node preferably has a hard drive with at least 40 GB of storage 
capacity. A computer system of the mvention can further comprise other mass storage units 
including, for example, one or more floppy drives, one more CD-ROM drives, one or more 
DVD drives or one or more DAT drives. 
20 Other external components typically mclude a user interface device 1205, whidi is 

most typically a monitor and a keyboard together with a graphical input device 1206 such as 
a "mouse." The computer system is also typically linked to a network link 1207 which can 
be, eg., part of a local area network ("LAN") to other, local computer systems and/or part of 
a wide area network ("WAN"), such as the Internet, that is connected to other, remote 
25 computer systems. For example, in the preferred embodiment, discussed above, wherein the 
computer system comprises a plurality of nodes, each node is preferably connected to a 
network, preferably an NFS network, so that the nodes of the computer system communicate 
with each other and, optionally, with other computer systems by means of the network and 
can thereby share data and processing tasks with one another. 
• 30 Loaded into memory during operation of such a computer system are several software 

components that are also shown schematically in FIG. 12, The software components 
comprise both software components that are standard in the art and components that are 
special to the present invention. These software components are typically stored on mass 
storage such as the hard drive 1204, but can be stored on oibst computer readable media as 
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well including, for example, one or more floppy disks, one or more CD-ROMs, one or mote 
DVDs or one or more DATs. Software component 1210 represents an operating system 
which is responsible for managing the computer system and its network interconnections. 
The operating system can be, for example, of the Microsoft Windows™ family such as 
Windows 95, Window 98, Windows NT, Wmdows 2000 or Wmdows XP. Alternatively, the 
operating software can be a Macintosh operating system, a UNIX operating system or a 
LINUX operating system. Software components 1211 comprises common languages and 
fimctions that are preferably present in the system to assist programs implementing methods 
specific to the present invention. Languages tiiat can be used to program the analytic 
methods of the invention include, for example, C and C++, FORTRAN, PERL, HTML, 
JAVA, and any of the UNIX or LINUX shell command languages such as C shell script 
language. The methods of the invention can also be programmed or modeled in 
mathematical software packages that allow symbolic entry of equations and hi^-level 
specification of processing, including specific algorithms to be used, thereby freeing a user of 
the need to procedurally program mdividual equations and algorithms. Such packages 
include, eg., Matlab from Malhworks (Natick, MA), Mathematica from Wolfram Research 
(Champaign, IL) or S-Plus from MathSoft (Seattie, WA). 

Software component 1212 comprises any analytic mettiods of the present invention 
described supra, preferably programmed in a procedural language or symbolic package. For 
example, software component 1212 preferably includes programs diat cause the processor to 
unplement steps of accepting a plurality of measured signals and storing the measured signals 
in the memory. For example, the computer system can accept measured signals that are 
manually entered by a user (eg., by means of the user interface). More preferably, however, 
the programs cause the computer system to retrieve measured signals from a database. Such 
a database can be stored on a mass storage (eg, a hard drive) or other computer readable 
medium and loaded into the memory of the computer, or the compendium can be accessed by 
the computer system by means of the network 1207. 

In addition to the exemplary program stmctures and computer systems described 
herein, other, alternative program structures and computer systems will be readily apparent to 
the skilled artisan. Such alternative systems, which do not depart from the above described 
computer system and programs structures either in spirit or in scope, are therefore intended to 
be comprehended within the accompanying claims. 

6, EXAMPLES 
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The following examples are preseated by way of illustration of the present invention^ 

and are not intended to limit the present invention in any way. 

6.L EXAMPLE 1: DESIGNING SIRNA FOR HIGH SILENCING EFFICACY 

A library of siRNAs targeting more than 700 genes was constructed. The siRNAs in 
5 the library were designed by use of a **standard" approach, based on a combination of limited 
design principles available from the scientific literature (Elbashir et al., 2001, Nature 
41 1 :494-8) and a method for predicting off target effects by sequence similarity scoring as 
described in Section 5.2. A set of 377 siRNAs was tested by Taqman analysis for their 
ability to silence their respective target genes. The set of 377 siRNAs are listed in Table 11. 
10 Table II lists the following information for the 377 siRNAs: the ID number of the siRNA, the 
accession number of the target gene, start position of the target sequence, target sequence, % 
silencing, the set it belongs (i.e., trauimg or test) in Set 1, the set it belongs in Set 2, and the 
SEQ ID NO. The results of this test showed that most siRNAs successfully silenced their 
target genes (median silencing, -75%), but individual siRNAs still showed a wide range of 
15 silencing performance. Good (or poor) silencing ability was not consistently associated with 
any particular base at any position, overall GC contrat, the position of the siRNA sequence 
withm the target transcript, or with alternative splicing of target transcripts. 

Hie potential relationship between target gene silencing and the base-composition, 
thermodynamics and secondary stracture of the siRNA and target sequmces was explored 

20 using a classifier approach. siRNAs were divided into groups containing those with less than 
median silencing ability C*bad" siRNAs) and those with median or better silencing ability 
("good" siRNAs). A number of metrics were evaluated for dieir ability to distinguish good 
and bad siRNAs, including base composition in windows of the 19mer siRNA duplex 
sequence and the flanking target region, secondary strocture predictions by various programs 

25 and thermodynamic properties. These tests revealed that siRNA efficacy correlated well with 
siRNA and target gene base composition, but poorly with secondary structure predictions and 
thermodynamic properties. In particular, the GC contmt of good siRNAs differed 
substantially from that of bad siRNAs in a position-specific manner (FIGS. 1«3). For 
example, good siRNA duplexes were not observed to be associated with any particular 

30 sequence, but tended to be GC rich at the 5' end and GC poor at the 3' end. The data indicate 
that a good siRNA duplex encourages preferential interaction of the antisense strand by 
being GC poor at its 3' end and discourages interaction of flie sense strand by being GC rich 
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at its 5' end. The data further demonstrate that position-specific sequence preferences extend 

beyond die boundaries of the siRNA target sequence mto the adjacent sequence(s). This 

suggests that steps during RNA silencing other than unwinding of the siRNA duplex are 

affected by position-specific base composition preferences. 

5 The GC-content difference betwcM good and bad siRNAs shown in FIGS. 1 and 2 

was used to develop methods for selecting good siRNAs* Best lesiUts were obtained with a 
position-specific scoring matrix (PSSM) approach. The PSSM provides weights for GC, A 
or U at every position on the sense strand of the target gene sequence from 10 bases upstream 
of the start to 10 bases downstream of the end of the siRNA duplex. The siRNA efficacy 

10 data were divided into two sets, one to be used for training and the other for an independent 
test. A random-mutation hill-climbmg search algorithm was used to optimize the weights for 
each base at each position of the PSSM simultaneously. The optimization criterion was the 
correlation coefficient between the target silencing of die siRNA and its PSSM score. 
Multiple runs of optimization on the training data set were averaged to complete each PSSM. 

15 Each PSSM was then tested on the independent (test) set of siRNAs. The performance of 
two PSSMs on their training and test data sets is demonstrated in Figure 2. 

An siRNA design method was developed based on a position-specific score matrix 
(PSSM). A scoring scheme is used to predict the efficacy of siRNA oligos. The score is a 
weighted sum of 39 bases (10 bases upstream of the 19mer, 19 bases on the siRNA proper, 
20 and 10 bases downstream) computed as follows: 

Score =^f^lniE,/p,) 

where P/ equals the random probability of any base, i.e., 0.25, and Ei the weight assigned to 
die base A, U, G or C at position /. Therefore, a total of 1 17 weights (39 positions times 3 
base types - G or C, A, U) need to be assigned and optimized. 

25 A random-mutation hill climbing (RMHC) search algorithm was utilized to optimize 

the weights based on a training oligo set and the resulting profiles applied to a test set, with 
the optimizing criteria being the correlation coefficient between the knock-down (KD) levels 
of the oligos and the computed PSSM scores. The metric to measure the effectiveness of the 
training and testing is the aggregate false detection rate (FDR) based on die ROC curve, and 

30 is computed as the average of the FDR scores of die top 33% oligos sorted by the scores 
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given by the trained predictor. In computing the FDR scores, those oligos with silencing 

levels less than the median are considered false, and those more than the median silencing 

levels considered true. 

, Different criteria were used to divide the existing siRNA performance data into 

5 training and test sets. The greatest obstacle to an ideal partition is that the vast majority of 
♦ siRNA oligos are designed with the standard method, whidh requires an AA dimer 

immediately before the 19mer oligo sequence. This limitation was found later to be 
detrimental rather than helpful to the design process and was abolished. To limit the 
mfluence of this on the training procedure, several partitions were used and more than one 
10 trained predictors, Le., PSSMs, (rather than single predictors) were combined to assign scores 
to the test oligos. 

Finally, a state-of-the-art siRNA oUgo design procedure (also rcfmed to as 
^'pipeline") was constructed. It incorporates the off-target prediction procedure and two 
ensembles of siRNA oligo efficacy predictors trained and tested on different data sets. A 
15 total of 30 siRNA oligos (6 oligos for each of 5 genes) were selected and tested. The results 
were significantly better than any of the previously existing pipelines. 

The initial training and testing results showed that the PSSM is very effective in 
predicting the on-target efficacy of siRNA oligos. Typically the aggregate FDR scores for 
training are between 0.02 and 0.08, and those for testing between 0.05 and 0.10. As a 
20 reference, random predictions have a mean aggregate FDR of 0.17, with the standard 

deviation being 0.02 (data computed with 10,000 randomly generated predictions). FIG. 3 
illustrates typical ROC curves, generated by an ensemble of about 200 randomly optimized 
predictors. It could be seen that the performance of the training is better than the test set, 
which is hardly surprising. Both curves are significantly bett^ than random. 

25 FIG. 5 illustrates the resulting sequence profiles from training and testing on several 

differmt oligo sets. This profile illustrates Oiat G or C bases are strongly preferred at the 
beginning, i.e., the 5' end, and strongly disfavored at the end, i.e., the 3' end, of the 19mer 
sequence. To confirm this observation, the average knock-down levels for oligos starting 
and ending with G/C or AAJ are computed, and (hose oligos starting with G/C and ending 

30 with A/U have the best performance, far superior to the other three categories. Simply by 
comparing the wei^ts at different positions, a 19mer oligo having a sequence of 
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GCGTTAATGTGATAATATA (SEQ ID NO: 1), and the oligos that are most similar to this 

sequence are identified as an siRNA that may have high silencing efficacy. 

The design method incorporated both PSSMs shown in FIG. 3 because (he 
combination gave better performance as compared to using eith«: one PSSM alone. The 
unproved siRNA design method selected oligonucleotides based on 4 principles: base 
composition, off-target identity, position in the transcript, and sequence variety. Certain 
oligonucleotides containing sequence ftom features such as imtranslated regions, repeats or 
homopolymeric runs were eliminated. Remaining oligonucleotides were ranked by their 
PSSM scores. Top-ranking oligonucleotides were selected for variety in GC content, in start 
position, and in the two bases upstream of the siRNA 19mer duplex. Selected 
oligonucleotides were then filtered for predicted off-target activity, which was calculated as a 
position-weighted FASTA alignment score. Remaining oligonucleotides were ranked by 
PSSM scores, subjected to a second round of selection for variety and finally re-ranked by 
their PSSM scores. The desired number of siRNAs was retained from the top of this final 
ranking. 

The improved method was compared to the standard method by side-by-side testing 
of new siRNAs selected by each. The results obtained with three siElNAs selected by each 
method are shown in Figure 3. siRNAs designed by the improved algorithm showed better 
median efficacy (88%, compared to 78% for the standard method siRNA) and were more 
uniform in their performance. The distribution of silencing efficacies of the improved 
algorithm siRNAs was significantly better than that of the standard method siRNAs for the 
same genes (p=0.004,Wilcoxon rank sum test). 

The test results of 30 experimmtal oligos using the new pipeline proved to be 
successful. Table III lists the 30 siRNAs. In the past, an siRNA design with the standard 
method had a median silencing level of 75%. Of the 30 experimental oligos, 28 had silencing 
levels equal to or better than 75%, 26 better than or equal to 80%, and 37% better than 90%, 
comparing with only 10% better than 90% using the standard method. Two target genes 
(KIF14 and IGFIR) had been very difficult to silence by siRNAs, with previous oligos 
achieving only 40% to 70% and no more than 80% silencing levels in the past The 12 new 
oligos targeting these gene all achieved silencing of at least 80% and 6 achieved 90% levels. 
The two oligos among the 30 oligos which had less than 75% silencing level tumed out to be 
targeting an exon that is unique to one target transcript sequence, but absent in all other 
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alternative splice forms of the same gene. Therefore, the failure of these two oligos was due 

to improper input sequence rather than the PSSM method Therefore, when given proper 

input sequences, the pipeline appears to be able to pick oligos that can knock down target 

genes by at least 75% for 100% of the target genes. 



Table H A library of 377 siRNAs 



BioID 



accessioa 
number 



start 
position 



19mer sequence 



% silencing Set 1 



Set 2 



SEQIDNO 



31 N1VL000075 437 TGTTGTCCGGCTGATCGAC 

36 NM;.001813 1036 ACTCTTACTGCTCTCCAGT 

37 NM.0O1813 1278 CTTAACACGGATGCTGOTG 

38 NM^001813 3427 GGAGAGCTTTCTAGGACCT 

39 N1VL.004073 192 AGTCATCCCGCAGAGCCGC 

40 NNL004073 1745 ATCGTAGTGCTTGTACTTA 

41 NM_0O4073 717 GGAGACGTACCGCTGCATC 

42 AK092024 437 GCAGTGATTGCTCAGCAGC 

43 NM.030932 935 GAGTTTACCGACCACCAAG 

44 NM^030932 1186 TGCGGATGCCATTCAGTG6 

45 NM.030932 1620 CACGG7TGGCA6AGTCTAT 

49 U53530 169 GCAA6TTGAGCTCTACCGC 

50 U53530 190 TGGCCAGCGCTTACTGGAA 

64 NM.006101 1623 GTTCAAAAGCTGGATGATC 

65 NM^006101 186 GGCCTCTATACCCCTCAAA 

66 NM.006101 968 AGAACCGAATCGTCTAGAG 

67 NM.000859 253 CACGATGCATAGCCATCCT 

68 NM.a00859 1075 CAGAGACAGAATCTACACT 

69 N1V1.000859 1720 CAACAGAAGGrrrGTCTTGT 

70 NH.000859 2572 TTGTGT6TQGGACCGTAAT 

71 NM_000875 276 GCTCACGGTCATTACC6AG 

72 N1VL000875 441 CCTGAGGAACATTACTC6G 

73 NM_000875 483 TGCTGACCTCTGTTACCrC 

74 NM^000875 777 CGACACGGCCTGTGTAGCT 

75 NM_000875 987 CGGCAGCCAGAGCATGTAC 

76 NM_000875 1320 CCAGAACTTGCAGCAACTG 
81 NmIo00875 351 CCTCACGGTCATCCGCGGC 

83 NM.000875 387 CTACGCCCTGGTCATCTTC 

84 NM_000875 417 TCTCAAGGATATTGGGCTT 

85 NM_000875 423 GGATATTGGGCnTACAAC 

86 NM.000875 450 CATTACTCGGGGG6CCATC 

87 NMJ)00875 481 AATGCTGACCTCT6TTACC 

117 NM^004523 1689 CTGGATCGTAAGAAGGCAG 

118 NM_004523 484 TGGAAGGTGAAAGGTCACC 

119 NM_004523 802 GGACAACTGCAGCTACTCT 
139 NM^002358 219 TACGGACTCACCTTGCTTG 

144 NH.001315 779 GTATATACATTCAGCTGAC 

145 NML001315 1080 GGAACACCCCCCGCTTATC 

146 NM.001315 1317 GTGGCCGATCCTTATGATC 

152 NM^001315 . 607 ATGTGATTGGTCTGTTGGA 

153 NM.001315 1395 GTCATCAGCTTTGTGCCAC 

154 NH-0013i5 799 TAATTCACAGGGACCTAAA 

155 NMJ)01315 1277 TGCCTACTTTGCTCAGTAC 
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Training 


Test 




A 

77.9 


Training 


Test 




93.0 


Training 


lest 


J J 1 


OA O 

84.8 


Training 


Test 




85.4 


Training 


Test 




0.0 


Training 


Test 




0.0 


Training 


Test 


J ■J J 


81.9 


Training 


Test 




14.1 


Training 


Test 




rv rv 

U.U 


Training 


Test 


•J JO 


86.0 


Training 


Test 


Djy 


90.0 


Training 


Test 


rzAA 


87.0 


Training 


Test 




64.5 


Training 


Test 




0.0 


Training 


Test 




9.3 


Training 


Test 


1AA 


23.5 


Training 


Test 




72.1 


Training 


Test 


0*+D 


85.7 


Training 


Test 


J'r / 


80.1 


Training 


Test 


'^4X 


100.0 


Training 


Test 




89.3 


Training 


Test 




86.2 


Training 


Test 


j*j t 


8o.7 


Training 


'Pact* 

lesi 


352 


QA 1 

o4,J 


Training 


Test 




0£ o 


Training 


lest 


JJ*T 


90.8 


Training 


Test 






Training 


Test 




31.4 


Training 


Test 


^^57 


o2.9 


Training 


Test 


J JO 


0.0 


Training 


Test 




84.9 


Training 


Test 


JOU 


20.3 


Training 


Test 


301 


74.2 


Training 


Test 




79.7 


Training 


Test 




34.6 


Training 


Test 


364 


91.8 






365 


71.2 






366 


0.0 






367 


85.9 






368 



85 
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At 1 1 A^TATTrAf^ArTGrA 


85.7 


jWO 








68.1 








TAf^A^^^AATcrArArrrA^^ 


73.6 


t(\AQ 






/TATTfiATTAnATf It! 1 GA 

\3M 1 1 S3M 1 1 MUM 1 \« 1 «m 1 1 VIM 


91.3 


DWy 




3vJ*r 


/TTCACTATTATfT'^ACTTC 


415 


3050 


NM^014875 


2959 


ATCTGGGGTGCTGATTGCr 


46.3 


3051 


NM_.014875 


1514 


6TGACA6TGGCAGTACGCG 


67.7 


3052 


NM„014875 


1114 


TCAGACTGAAGTTGTTAGA 


80.8 


3053 


NM_014875 


2079 


GTTGGCTAGAATTGGGAAA 


91.8 


3054 


NH.014875 


3560 


GAAGACCATAGCATCCGCC 


74.8 
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370 
371 
372 
373 
374 
375 
376 
377 
378 



Table III 30 siRNAs designed using the method of this example 



BioID 


Accession 


Gene name 


Sequence (sense strand) 


% Silencing 


SEQID 






AJLTiH 


r AHOT A A A GTPAGAG AC AT 


87 


379 




XT* if f\1/lQ'J^ 




nOfJATTnArnfiT^AGTAAfiA 


89 


380 




"NTN/T ni/tfiT*? 
iNiVl_U 1 Ho / D 




r APTO A ATGTGGG AGGTGA 


92 


381 






XS-U. 1 *T 


GTCTGGGTGGAAATTCAAA 


93 


382 


3848 


NML-014875 


KIF14 


CATCTTTGCTGAATCGAAA 


86 


383 




NM 014875 


KIF14 


CAGGGATGCTGTTTGGATA 


95 


384 


locn 
3o3ll 








87 


385 


IOC 1 

3o31 




PI IT 


OrtTnTTPfifYififiP A AG ATT 


86 


386 


3o3Z 




PT If 


C!GCCTCATCCTCTACAATG 


88 


387 




IN IVL VjUJUJl/ 


PT K 


GTTCTTTACTTCTGGCTAT 


97 


388 


3854 


NM_005030 


PLK 


CTCCTTAAATATTTCCGCA 


92 


389 


3855 


NMLO05O3O 


PLK 


CTGAGCCTGAGGCCCGATA 


75 


390 


3856 


NM_000875 


IGFIR 


CAAATTATGTGTTTCCGAA 


90 


391 


3857 


NML000875 


IGFIR 


CGCATGTGCTGGCAGTATA 


84 


392 


3858 


NMLJ000875 


IGFIR 


CXZGAAGAnrCACAGTCAA 


79 


393 


3859 


NML000875 


IGFIR 


ACCATTGATTCTGTTACTT 


86 


394 


3860 


NML(X)0875 


IGFIR 


ACCGCAAAGTCTTTGAGAA 


88 


395 


3861 


NM_000875 


IGFIR 


GTCCTGACATGCTGTTTGA 


79 


396 


3862 


NM_001315 


MAPK14 


GGAATTCAATGATGTGTAT 


85 


397 


3863 


NM_001315 


MAPK14 


GCTGTTGACTGGAAGAACA 


84 


398 


3864 


NML.001315 


MAPK14 


CTCCTGAGATCATGCTGAA 


81 


399 


3865 


NM_001315 


MAPK14 


CCATTTCAGTCCATCATTC 


88 


400 


3866 


NM_001315 


MAPK14 


CAGATTATGCGTCTGACAG 


25 


401 


3867 


NVL001315 


MAPK14 


CGCTTATCTCATTAACAGG 


14 


402 


3871 


NM_004523 


KIFll 


GAGCCCAGATCAACCTTTA 


87 


403 


3872 


NML004523 


KIFll 


CTGACAAGAGCTCAAGGAA 


89 


404 


3873 


NM.004523 


KIFll 


GGCATTAACACACTGGAGA 


92 


405 


3874 


NM^004523 


OFll 


GATGGCAGCTCAAAGCAAA 


93 


406 


3875 


NM[_004523 


KIFll 


CAGCAGAAATCTAAGGATA 


86 


407 


3876 


NMJ)04523 


KIFll 


OGTTCTGGAGCTGTTGATA 


95 


408 



6.2. EXAMPLE 2: SELECTION OF SIRNAS FOR SILENCING SPEdFIClTY 

5 The importaace of off-target effects of siRNA and shRNA sequences have been 

shown. Mictoairay experiments suggest that most siRNA oligos result in downregulation of 
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off-target genes through direct interactions between dsRNA and the off-target transcripts. 

While sequence similarity between dsRNA and transcripts appears to play a role in 

determining which off-target genes will be affected, sequence similarity searches, even 

combined with thermodynamic models of hybridization, are insufficient to predict off-target 

5 effects accurately. However, alignment of off-target transcripts with offending siRNA 

sequences reveals that some base pairing interactions betwera the two appear to be more 

important than others (Fig. 6). 

Figure 6 shows an example of alignments of transcripts of off-target genes to the core 
19mer of an siRNA oligo sequence. Off-target genes were selected fi:om the Human 25k 
10 v2.2.1 microarray by selecting for kinetic patterns of transcript abundance consistent with 
direct effects of siRNA oligos. Alignments were generated with FASTA and edited by hand. 
The black boxes and grey area demonstrate the higher level of sequence similarity in the 3' 
half of the alignment 

The alignment shown in Fig. 6 and similar data for odier siRNAs were combined to 
15 generate a position-specific scoring matrix for use in predicting off-target effects. The 

matrix, which reflects the frequency with which each position in the oligo is found to match 
affected off-target transcripts, is represented in Fig. 7, 

The position-specific scoring matrix is used to calculate scores for alignments 
between a candidate RNAi sequence and off-target transcript sequences. Alignments of 
20 interest are established with a low-stringency FASTA search and the score for eadi 
alignment is calculated with the Eq. 6 

Score^^hiiE,/0.25) 

where: n is the length of the alignment (generally 19); E,= P, from Fig. 7 if position i in the 
alignment is a match and E,- = (l-P,)/3 if position i is a mismatch. It was observed that the 
25 number of alignments for a given siRNA which score above a threshold is predictive of the 
number of observed off-target effects. The threshold of the score was optimized to 
maximize the correlation between predicted and observed numbers of effects (Fig. 8). The 
selection pipeline uses the optimized threshold to favor sequences with relatively small 
numbers of predicted off-target effects. 
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6,3. EXAMPLE 3: CmVE MODEL PSSMS 



PSSMs were also generated by a method which hypothesized dependency of the base 
composition of any one position on its neighboring positions, referred to as **curve models". 

The curve models were generated as a sum of normal curves. Each curve represents 
the probability of finding a particular base in a particular regioa The value at each position 
in the summed normal curves is the weight given to that position for the base represented by 
the curve. The weights for each base present at each position in each siRNA and its flanking 
sequences were summed to generate an siRNA's score, i.e., the score is 2 Wj. The score 
calculation can also be described as the dot product of the base content in the sequence with 
the weights m the curve model. As such, it is one way of representing the correlation of die 
sequence of interest with the model. 

Curve models can be initialized to correspond to the major peaks and valleys present 
in the smoothed base composition difference between good and bad siRNAs, e.g., as 
described in FIGS. 1 A-C and 5 A-C. The initial model can be set up for the 3-peak G/C curve 
model as follows: 

Peakl 

mean: LS 

standard deviation: 2 

amplitude: 0.0455 

Peak 1 mean, standard deviation and amplitude are set to correspond to the peak in the mean 
difference in GC content between good and bad siRNAs occurring widiin bases -2 - 5 of the 
siRNA target site in Set 1 training and test sets. 

Peak2 

mean: 11 

standard deviation: 0.5 

anq)litude: 0.0337 
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Peak 2 mean, standard deviation and amplitude are set to correspond to the peak in the mean 

difference in GC content between good and bad siRNAs occurring within bases 10-12 of the 

siRNA target site in Set 1 training and test sets. 

Peaks 

S mean: 18.5 

standard deviation: 4 

amplitude: -0.0548 

Peak 3 mean, standard deviation and amplitude are set to correspond to the peak in the 
mean differ^ce ia GC content between good and bad siRNAs occurring within bases 12-25 
10 of the siRNA target site in Set 1 training and test sets* 

Peak height (amplitude), center position in the sequence (mean) and width (standard 
deviation) of a peak in a curve model can be adjusted. Curve models were optimized by 
adjusting the amplitude, mean and standard deviation of each peak over a preset grid of 
values. Curve models were optimized on several training sets and tested on several tf;st sets, 
15 e.g., training sets and test sets as described in Table U. Eadi base - G/C, A or U - was 
optimized separately, and then combinations of optimized models were screened for best 
performance. 

The optimization criteria for curve models were: (1) the fraction of good oligos in the 
top 10%, 15%, 20% and 33% of the scores. (2) the false detection rate at 33% and 50% of the 
20 siRNAs selected, and (3) the correlation coefficient of siRNA silencing vs. siRNA scores 
used as a tiebreaker. 

When the model is trained, a grid of possible values for amplitude, mean and standard 
deviation of eadi peak is explored. The models with the top value or within the top range of 
values for any of the above criteria were selected and examined further. 

25 G/C models were optunized with 3 or 4 peaks. A models were optunized with 3 

peaks. U models >vere optimized with 5 peaks. 

Exemplary optimization ranges for the models are listed below: 
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3 Peak G/C models: 
peak 1: 

amplitudes: gel = 0-0.091 
means: gel = -2.5 - 1.5 
5 standard deviations: gel =2.5-4 
peak 2: 

amplitudes: ge2 = 0.0337 - O.lOll 
means: ge2 = 11 - 11.5 
standard deviations: gc2 = 0.5 — 0.9 
10 peak 3: 

amplitudes: gc3 = -0.1644 - -0.O822 
means: gc3 = 1 8.75 ~ 20J5 
standard deviations: gc3 = 2.5 — 3.5 

15 

4 Peak G/C models: 
peak 0: 

amplitudes: gcO = 0 - 0.091 
means: gcO = -5.5 - -3.5 
20 standard deviations: gcO = 1-2.5 
peak 1: 

amplitudes: gel = 0 - 0.091 
means: gel = -2.5 - 1.5 
standard deviations: gel =2.5 — 4 
25 peak 2: 

amplitudes: gc2 = 0.0337 - O.lOll 
means: ge2 = 11 - 11.5 
standard deviations: gc2 = 05 — 0.9 
peak 3: 

30 amplitudes: ge3 = -0.1644 - -0.0822 
means: gc3 = 18.75-20.75 
standard deviations: gc3 = 2.5 — 3.5 

5 Peak U models: 
35 U peak 1: 

amplitudes: ul = -0.2 - 0.0 

means: ul = 1 - 2 

standard deviations: ul = .75 - 1.5 

U peak 2: 
40 amplitudes: u2 = 0.0 - 0. 16 
means: u2 = 5 - 6 
standard deviations: u2 = .75 - 1.5 

Upeak3: 

amplitudes: u3 = 0.0 - 0.1 
45 means: u3 = 10-11 

standard deviations: u3 = 1 - 2 
U peak 4: 

amplitudes: u4 = 0.0 - 0.16 
means: u4 = 13 - 14 
50 standard deviations: u4 = . 75- 1.5 



90 



100005189 



wo 2005/042708 



PCT/US2004/035636 



U peak 5: 

amplitudes: uS = 0.0 * 0.16 

means: u5 = 17 - 18 

standaid deviations: uS = 1 - 3 

5 

3 Peak A model: 
' A peak 1: 

amplitudes: al = 0.0442 - 0.2210 
means: al = 5.5 -6.5 
iO standard deviations: al = 1 - 2 
A peak 2: 

amplitudes: a2 = -.05 -0 
means: a2 = 10- 12.5 
standard deviations: a2 = 2 J - 4.5 
15* A peak 3: 

amplimdes: a3 = 0.0442 - 0.2210 

means: a3 = 18-20 

standard deviations: a3 = 4 - 6 

20 An exemplary set of curve models for PSSM is shown in FIG. 11 A. FIG. 1 IB shows 

the performance of the models on training and test sets. 

6.4. EXAMPLE 4: BASE COMPOSITION MODELS FOR PREDICTION OF STRAhg) 

PREFERENCE OF siRNAS 

The mean difference in G/C content between good and bad siRNAs provides a model 
25 for G/C PSSMs which can be used to classify siRNA functional and resistant motifs. As it is 
known that both strands of the siRNA can be active (see, e.g., Elbashir et al., 2001, Genes 
Dev. 15:188-200), it was of mterest to discover how well the G/C contents of both sense and 
antisense strands of siRNAs fit the model of siRNA functional target motif G/C content 
derived from the mean difference in G/C content between good and bad siRNAs. To this 
30 end, the reverse complements of good and bad siRNAs were examined. These reverse 

complements correspond to the hypothetical perfect match target sites for the sense strands of 
the siRNA duplexes. The reverse complements were compared to the actual good and bad 
siRNAs, represented by the actual perfect match target sites of the antisense strands of the 
siRNA duplexes. 

• 35 FIG. 14A shows the difference between the mean G/C content of the reverse 

complements of bad siRNAs with the mean G/C content of the bad siRNAs themselves, 
within the 19mer siRNA duplex region. The difference between the mean G/C content of 
good and bad siRNAs is shown for comparison. The curves were smoothed over a window 
of 5 (or portion of a window of 5, at the edges of the sequence). 
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FIG. 14B shows the difference between the mean G/C conteait of the reverse 
complements of good siRNAs with the mean G/C content of bad siRNAs, within the I9mer 
siRNA duplex region. The difference between the mean G/C content of good and bad 
siRNAs is shown for comparison. The curves were smoothed over a window of 5 (or portion 
5 of a window of 5, at the edges of the sequence). ' 

The reverse complements of bad siRNAs were seen to be even more different from 
the bad siRNAs themselves than are good siRNAs. On the average, the reverse complements 
of bad siRNAs had even stronger G/C content at the 5' end than the good siRNAs did and 
were similar in G/C content to good siRNAs at the 3' end. In contrast, the reverse 
10 complements of good siRNAs were seen to be substantially more similar to bad siRNAs than 
the good siRNAs were. On average, the reverse complements of good siRNAs hardly 
differed from bad siRNAs in G/C content at the 5' end and were only slightly less G/C rich 
than bad siRNAs at the 3' end. 

These results appear to unply that the G/C PSSMs are distinguishing siRNAs with 
15 strong sense strands as bad siRNAs from siRNAs with weak sense strands as good siRNAs. 
An siRNA whose G/C PSSM score is greater than the G/C PSSM score of its reverse 
complement is predicted to have an antisense strand that is more active than its sense strand. 
In contrast, an siRNA whose G/C PSSM score is less than the G/C PSSM score of its reverse 
complement is predicted to have a sense strand that is more active than its antisense strand. 

20 It has been shown that increased efficacy corresponds to greater antisense strand 

activity and lesser sense strand activity. Thus the G/C PSSMs of this invention would appear 
to distinguish good siRNAs with greater efficacy due to dominant antisense strand activity 
("antisense-active" siRNAs) from siRNAs with dominant sense strand activity ("sense- 
active" siRNAs). 

25 The relevance of comparison of G/C PSSMs of siRNAs and their reverse 

complements for prediction of strand bias was tested by comparison with estimation of strand 
bias from siRNA expression profiles by the S'-biased method. 

siRNAs and their reverse complements were scored using the smoothed G/C content 
difference between good and bad siRNAs within the 19mer, shown in FIG. 14A, as the 
30 wei^t matrix. The G/C PSSM score of each strand was the dot product of the siRNA strand 
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G/C content with the G/C content difference matrix, following the score calculation method 
of curve model PSSMs. 

siRNAs were called sense-active by die 3'-biased mediod of expression profile 
analysis if the antisense-identical score exceeded the sense-identical score. siRNAs were 
5 called sense-active by the G/C PSSM method if their reverse complement G/C PSSM score 
exceeded their own G/C PSSM score. 

In FIG. 15, siRNAs were binned by measured silencing efficacy, and the frequency of 
sense-active calls by the expression profile and G/C PSSM methods was compared. 
Although fliese techniques are based on distinct analyses, the agreement is quite good. Both 
10 show that a higher proportion of low-silencing siRNAs vs. high-silencing siRNAs are 
predicted to be sense active. The correlation coefficient for (siRNA G/C PSSM score - 
reverse complement G/C PSSM score) vs. logio(sense-identity score/antisense-identity score) 
is 0.59 for the set of 61 siRNAs binned in FIG. 15. 

6.5. EXAMPLE 5: DESIGNING SIRNAS FOR SILENCING GENES HAVING LOW 
15 TRANSCRIPT LEVELS 

In the previous examples, an improved siRNA design algorithm that p^mits selection 
of siRNAs with greater and more uniform silencmg ability was described. Despite this 
dramatic improvement, some genes remain difficult to sDence witih high efficacy. A general 
trend toward poorer silencing for poorly-expressed genes (less than -0.5 intensity on 
20 microarray; <5 copies per cell; Figure 16) was observed. This example describes 

identification of parameters affecting silencing eflScacy of siRNAs to poorly expressed genes. 

Twenty-four poorly-expressed genes were selected for detailed analysis of parameters 
affecting siRNA silencing efficacy. A number of criteria were evaluated for their ability to 
distinguish good and bad siRNAs, including base composition of the 19mer siRNA duplex 

25 sequence and the flanking target region. Jn addition, the contribution of the GC content of 
the target transcript was considered. These tests revealed that siRNA efficacy correlated well 
with siRNA and target gene base composition. In particular, the GC content of good siRNAs 
differed substantially from that of bad siRNAs in a region-specific maraier (Figure 17). The 
sequences of siRNAs used in generating Figure 17 are listed in Table IV. Good siRNA 

30 duplexes tended to be GC poor at positions 2-7 of the 5' end of the sense strand, and GC poor 
at the 3' end (positions 18-19). Furthermore, siRNA efficacy correlated with low GC content 
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in the transcript sequence flanking the siRNA binding site. The requirement for low GC 

content as a determinant of siRNA efficacy may explain the difficulty in silencing the poorly- 

expr^sed transcripts, as these transcripts tend to be GC rich overall. Base composition of the 

siRNA duplex also affected silencing of poorly expressed genes. In particular, the GC 

5 content of good siRNAs differed substantially from that of bad siRNAs in a region-specific • 
manner (Figure 17). Good siRNA duplexes tended to be GC rich at the first position, GC 
poor at positions 2-7 of the S' end of the sense strand, and GC poor at die 3* end (positions 
18-19.) Of the criteria examined, low GC content in positions 2-7 of the sense strand (Figure 
17, dotted line) produced the greatest improvement in silencing eflBcacy. This is consistent 

10 with the region of the siRNA implicated in the catalysis step of transcript sUencing. Low GC 
content in this region may provide accessibility or optimal helical geometry for wihanced 
cleavage. Requiring low GC content in this region of the siRNA may also select for target 
sites that contain low GC content flanking the binding site, whidi also correlated with 
silencing efficacy. 

15 The base composition for good siRNAs to poorly-expressed genes diverges somewhat 

from our previously-derived base composition criteria for good siRNAs to well-expressed 
genes (Figure 17, solid line). Good siRNAs to both types of genes show a preference for 
high GC at position 1, and low GC at the 3* end. However, siRNAs for well-expressed genes 
show an extreme asymmetry in GC content between the two termini, while siRNAs for 

20 poorly-expressed genes prefer a more moderate asymmetry. Our previous design algorithm 
seeks to maximize asymmetry, in accordance with the features seen in good siRNAs to well- 
expressed genes. Our cmimt results indicate that base composition of more than one region 
of the siRNA can influence efficacy. Different regions of the siRNA may be more critical for 
silencing of different targets, perhaps depending on target transcript features such as 

25 expression level or overall GC content Consistent with this idea, different commercially 
available desigq algorithms work well on differrat subsets of genes (data not shown). 

A new siRNA design algorithm was developed based on the GC composition derived 
for pooriy-expressed genes. The new algorithm indudes the following adjustments to the 
previous algorithm: 

30 (1) selection for 1-3 G^C in sense 19mer bases 2-7, * 

(2) sense 19mer base 1 & 19 asymmetry (position 1, G or C; position 19, A or T), 
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(4) greatest off-target BLAST match no more than 16, and 

(5) 200 bases on either side of the 19iner are not repeat or low-complexity sequences. 
Hie new algorithm was compared to the algorithm described in previous examples^ by side- 

5 by-side testing of new siRNAs selected by each. The results obtained widi three siRNAs 
selected by each method are shown in Figure 18. siRNAs designed by the new algorithm of 
the present example showed better median efficacy (80%, compared to 60% for the standard 
method siRNA) and were more uniform in their performance. The distribution of silencing 
efficacies of siRNAs obtained by the new algorithm was significantly better than that of the 

10 previous algorithm for the same genes (p = 10"^, Wilcoxon rank-sum). siRNAs designed 
using the new design algorithm also appear effective at silencing more highly-expressed 
transcripts, based on an examination of 12 highly-expressed genes. 

The new design criteria may capture features important to siRNA functionality in 
genial (Figure 19), and emphasize that difiEerent regions of siRNAs have different functions 

15 in transcript recognition, cleavage, and product release. Bases near the 5' end of the giude 
strand are implicated in transcript binding (both on- and off-target transcripts), and have 
recently been shown to be sufFicioat for target RNA-binding enorgy. The design criteria are 
also consistent with available data on how siRNAs interact with RISC, the protem-RNA 
complex that mediates RNA silencing. These studies show that weaker base pairing at the 5' 

20 end of the antisense strand (3' end of the duplex) encourages preferential interaction of the 
antisense strand with RISC, perfiaps by facilitating unwinding of the siRNA duplex by a 5'-3' 
helicase component of RISC. As in the previous design, oiu: new design maintains the base 
composition asymmetry that encourages preferential interaction of the antisense strand. This 
suggests that the previous inefficiency of silencing poorly-expressed transcripts is not due to 

25 inefficient association with RISC, but rather is likely due to inefficient targeting of the RISC 
complex to the target transcript, or inefficient cleavage and release of the target transcript. 
The designs described in these examples include a preference for U at position 10 of the 
i sense strand, which has been associated with improved cleavage efficiency by RISC as it is in 

most endonucleases. The observed preference for low GC content flanking the cleavage site 
* 30 may enhance accessibility of the RISC/nuclease complex for cleavage, or release of the 

cleaved transcript, consistent with tcc&at studies demonstrating that base pairs formed by the 
cratral and 3' regions of the siRNA guide strand provide a helical geometry required for 
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catalysis. The new design criteria may increase the efficiency of these and additional steps in 
the RNAi pathway, thereby providing efficient silencing of transmpts at different levels of 
expression. 



Table IV siRNAs for Figure 17 



ACCESSION NUMBER 


GENE 


sIRNA sequence 


SKOroNO 


AKO&TSflA NM 030932 


D1APH3 


GCAGTGATTGCTCAGCAGC 


409 


AKQ^lOehl NM 030932 


D1APH3 


GAGTTTACCGACCACCAAG 


410 


AKOIVTfflA NM 03QQ32 


DfAFHS 


CACGGTTGGCAGAGTCTAT 


411 


Airn09f¥>d NM 03003? 


DIAPH3 


TGCGGATGCCATTCAGTGG 


412 


MM 01/187^ 


KIF14 


AAACTGGGAGGCTACTTAC 


413 


MM (iidsn^ 


KIF14 


CTCACATTGTCCACCAGGA 


414 




KIF14 


GACCATAGCATCCGCCATO 


415 


NM ni4R7S 

1NIVI_V1 "to I J 


KIF14 


AGAGCCTTCOAAGGCTTCA 


416 


NM nt4R7S 


KIF14 


TAGACCACCCATTGCTrcC 


417 


MM ntiL97S 


KIF14 


ACTGACAACAAAGTGCAGC 


418 


U J jjJU 


nMrm 

L/INVxIl 


TGGCCAGCGCTTACTGGAA 


419 


U □ 3 J jU 


DNCHl 


GCAAGTTGAGCTCTACCGC 


420 


IN M^wUO 


HMnm 


TIXTIXjIXjIGGGACCGTAAT 


421 




HMnr*R 


CAACAGAAGGTTGTCTIXJI' 


422 


MM Annaco 


UMHPR 


cagagacagaatctacact 


423 


MM nnn&^Q 


HMnm 


cacgatgcatagccatcct 


424 


MM IVkflT7l 




G AGGTAC AATTGCG AAT AT 


425 


MM finnT7i 


NPCl 


GCCACAGTCXrrcrrGCTGT 


426 


\i\A nnfW7i 


MPT*! 


TACTACGTCXjGACAGAGTT 


427 


rMjvi^UUUfc/ 1 


NPCl 


AACTACAATAACGCCACro 


428 


MM nnAWi 


KNSLl 


TACTGATAATGGTACTGAA 


429 




IVlNdt.«J 


TACATGAACTACAAGAAAA 


430 




IfMQl t 


rs An* A AGCTT AATTGCTTT 


431 


NMJUWjZJ 


IfMQT 1 


AfTTTGArC^AACACAATGCA 


432 


NMJ004523 


ItTMCI 1 
NjNoLd 




433 


MKif /WACO'S 


KINdldl 


AnOAfTTRATAATTAAAGGT 


434 




niNOlrfl 


A AACTCTG AGT AC ATTGGA 


435 




yvjcT 1 


TArTAAACAGATTGATGTT 


436 




VM^I 1 


nrTCAAGGAAAACATACAC 


437 






CUGGATCGTAAGAAGGCAG 


438 






GACTTCATTGACAGTGGCC 


439 




KNSLl 


GGACAACTGCAOCTACTCT 


440 


MM nfVt^73 


KNSLl 


GGGGCAGTATACTGAAGAA 


441 




KNSLl 


GACCrGTCCCTTTTAGAGA 


442 




KNSLl 


AAAGGACAACTGCAGCTAC 


443 




IklNOInfl 


TACAAAGAATAAATTTTCT 


444 


MM rNM5l?3 


KNSLl 


TGGAAGGTGAAAGGTCACC 


445 


NM 00d523 


KNSLl 


TAACTGTTCAAOAAGAGCA 


446 


NM Q04S23 


KNSLl 


TCTATAA' rri'A' lA r I V 1 "IT 


447 


NM 004523 


KNSLl 


GGGACCCTCATGGCGTCGC 


448 


NM 004S23 


KNSLl 


CCAGGGAOACTCCGGCCCC 


449 


NM_004523 


KNSLl 


AnTAAITTGGCAGAGCGG 


450 


NM_004523 


KNSLl 


TGGAAATATAAATCAATCC 


451 


NM_004523 


KNSLl 


ACTAACTAGAAtXXTCCAO 


452 


NM_004523 


KNSLl 


AAGAAGAATATATCACATC 


453 


NM^004523 


KNSLl 


TTCTTGTATATTATTAAGT 


454 


NM 004064 


CDKNIB 


GACGTCAAACGTAAACAGC 


455 


NM 004064 


CDKNIB 


TGGTGATCACrCCAGGTAG 


456 


NM_004064 


CDKNIB 


TGTCCCnTCAGAGACAGC 


457 




CNK 


GTTACCAAGAGCCTCnTG 


458 


NM 004073 


CNK 


ATCGTACrrGCTrGTACITA 


459 


NM 004073 


CNK 


GAAGACCATCTGTGGCACC 


460 


NM_004073 


CNK 


GGAGACGTACCGCTGCATC 


461 


NM„004073 


CNK 


TCAGGGACCAGCnTACTG 


462 


NM_004073 


CNK 


AGTCATCCCGCAGAGCCGC 


463 


NM_001315 


MAPK14 


GGCCmrCACGGGAACTC 


464 


NMJ0013I5 


MAPK14 


GAAGCTCTCCAGACCATIT 


465 


NM 00131S 


MAPK14 


TGCCTACnTGCTCAOTAC 


466 


NM_001315 


MAPK14 


ATGTGATTGGTCTGTTGGA 


467 
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NM_00I315 


MAPK14 


GTCATCAGCTTTCjTCXXZAC 


468 


NH.0013I5 


MAPK14 


CCTACAlGAGAACTGCGGTT 


469 


NM_0013I5 


MAPKU 


CCACjTGGCCGATCCTTATG 


470 


NM.00I315 


MAPK14 


GTGCCTCTrGTTGCAGAGA 


471 


NM_.00I3I5 


MAPK34 


TTCTCCCjAGGTCTAAAGTA 


472 


NM_001315 


MAPK14 


TAATTCACAGGGACCTAAA 


473 


NH_00I315 


MAPia4 


GTGGCCGATCCTTATGATC 


474 


NM_00I315 


MAPK14 


CTATATACATrCAGCTGAC 


475 


NM^001315 


MAPK14 


AATATCCTCAGGGCiTGGAG 


476 


NM_001315 


MAPK14 


GGAACACXCCCCCXTTTATC 


477 


NM_006101 


HEC 


CTGAAGGCTTCCTTACAAG 


478 


NM_006101 


HEC 


AGAACCGAATCOTCTAGAG 


479 


NM 006101 


HEC 


PAGAAt Tl'ltj IGGAATGAGG 


480 


NM 006101 


HEC 


fiTTCA A. AAGCTGG ATG ATC 


481 


NM_00610I 


HEC 




482 


NM.00610I 


HEC 




483 


NM.000314 


PTEN 


rrTiArf^APAGrrTAGAACTT 


484 


NM_000314 


PTEN 


fAITTAr^S Ann AfiTTYTrr A A A 


485 


NM 000314 


FTEN 


rTATTCrf^AfTnTAnAGGCG 


486 


NM_000314 


PTEN 




487 


NM_000314 


PTEN 


A AfifiP AHPT A A ACJG AAGTG 


488 


NM.000314 


PTEN 


TfjnAninr^r; A ATfinv AG AA 


489 


NM 000075 


CDK4 


nrn a Anrv'TY'TTir'r^TTTPff a 


490 


NM.000075 


CDK4 


PAfrrr A ArirrrinrTfiAfTT 

v^fVVJ I V>/-Vx\VJV^ 1 VJvJV-. 1 \Jt\Kr 1 1 


491 


NMJD00a75 


CDK4 


rjn A Trr*T"r; A TtVY^fY*' A ITITI' 

VJUrv I L. A vJrV i \JvAJV^V>/\\J 111 


492 


NM_000075 


CDK4 




493 


NMJ)06622 


SNK 


TrrrrArTSAnATY^APAnAiT 


494 


NM_006622 


SNK 




495 ' 


NNC006622 


SNK 




496 


NM 139164 


STARD4 


A fY* AAA fym* n TTi Ar* Aflfl 


497 


NM_139164 


STARD4 


PTf yiTPfSfwAnAAAAf OTTC 


498 


NM^139164 


STARD4 


nArAAr^pr'AAAf^AnAGrrr' "' 


499 


NNL.139164 


STARD4 


f7rf**TTOArTrinr»ATfiAAAA 


500 


NMJ0OSQ3O 


PLK 


ClGflACl A AOATfTTPPATGG A 


501 


NM.O0S03O 


PUC 




502 


NM_005030 


PLK 




503 


NM_005O30 


PLK 




504 


NM_0O503O 


PLK 




505 


NM.005030 


PLK 




506 


NM 005030 


PLK 


V9\)\JV>VXVJV' III \jX^y^r\f\\J I VJV^ 


507 


NMJ)05030 


PLK 


Aril A^JT'nTSrTTA ATflArTt A 


508 


NM 005030 


WJC 




509 


NMJ05030 


PLK 




510 


NM^005030 


PLK 


rnfJAmTGCAGCTCCCGGA 


511 


NM_005030 


PLK 


A An Art Art^ APfTTCIGG AT 


512 


NM 005030 


PLK 


AfTTfiGijTGGACTATTCGGA 


513 


NM^00503O 


PUC 


TYTT ATT^ ATYTTAT APnTfiT 


514 


NM 005030 


PLK 


A/VVJ A/V.V>/v\V»^/\Vl 1 WIVI I 1 


515 


NM 005030 


PLK 


\J\jH^/\i\^Ji\ 1 1 VJ 1 VJV^Vv I /WJ I 


516 


NM 005030 


PLK 




517 


NM 005030 


PLK 


PTP A AfsGTCTCfT AAT AfiC 


518 


NM_005O3O 


PLK 


/^A/vYw^Ara^TTY^rinrjAfiPA 


519 


NMLOOSOSO 


PLK 




520 


NM 005030 


PLK 


i^VA-iV» I V^V-V-v.V< 1 U./\/\i^v_4^\_rV 


521 


NH_005030 


PLK 




522 


NMJ)05030 


PLK 


TTTTrr^rTAAAAGAGAPr 


523 


NMJXbOBO 


PLK 


TAP ATCi AGCG AGC ACTTGC 


524 


NM„005O3O 


PUC 


PA ATG<^rTX:!CAAG(XTrrDG 


525 




IGFIR 


Pf n AT A TTfiGGPTrTAPA AP 


526 


NMJ)00875 


IGFIR 




527 


NML000875 


IGFIR 


ra^TY^ A rf^rfTrmp ATT APTYi Ars 


528 


NM^000875 


IGFIR 


I^ATrSA'TTr'AOATYTinPfYSnA 


529 


NM_000875 


IGFIR 


pn apacggoptg' f i? i 'agct 


530 


NM_000875 


IGFIR 


AATGCTGACCrCTGTTACC 


531 


NM 000875 


IGFIR 


TCTCAAGGATATTGGGCrr 


532 


NM_000875 


IGFIR 


CATTACTCGGGGGGCCATC 


533 


NMJXMWVS 


IGFIR 


TGCTGACCTCTGTTACCTC 


534 


NML000875 


IGFIR 


CTACGCCcrcGrrcATcrrc 


535 


NMJ)00875 


IGFIR 


CCTCACXXjTCATCCGCGGC 


536 


mAjxms 


IGFIR 


CCTGAGKJAACATTACTCGG 


537 


NMLOOlSn 


CENPE 


GGACAGCmCTAGGACCT 


538 
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NM_001813 


CHNPB 


GAAGAGATCCCAGTGCTTC 


539 


NM_001813 


CENPE 


ACrCnTACTGCTCTCCAGT 


540 


NH_001813 


CENPB 


TCTGAAAGTGACCAGCTCA 


541 


NM_0018I3 


CENPE 


GAAAATGAAGCTTTGCGGG 


542 


NH.001813 


CENPE 


CTTAACACGGATGCTGGTG 


543 


NM^004958 


FRAPl 


CriGCACilXCTTGTTrGTG 


544 


NMJ)04958 


FRAP! 


CAACCTCCAGGATACACrC 


545 


NM_0O4958 


FRAPl 


GACATGAGAACCTGGCTCA 


546 


NM.0O4958 


FRAPl 


CCAACTTTCTAGCTGCTGT 


547 


NM 004958 


FRAPl 


AGGACTTCGCCCATAAGAG 


54S 


NH_004958 


FRAPl 


TAATACAGCTGGGGACGAC 


549 


NMJIX)5163 


AKTl 


GCTGGAGAACCTCATGCTG 


550 


NMJ)05163 


AKTI 


CGOVCCTTCCATGTGGAGA 


551 


NM^005lfi3 


AKTl 


AGACGTTTTTGTGCTGTGG 


552 


NM_002358 


MAD2LI 


TACGGACrCACCITGCTrG 


553 


NIVL00055I 


VHL 


GGCATTGGCATCTGCmT 


554 


NH_0OO551 


VHL 


GTGAATGAGACACrcCAGT 


555 


NM„00055J 


VHL 


TGTrGACGGACAOCXrrATT 


556 


NH_0OQ551 


VHL 


GATCTGK3AAGACCA0CCAA 


557 


NM_0O0551 


VHL 


AGGAAATAGGCAGGGTCrrG 


558 


NM^000551 


VHL 


CAQAACCCAAAAGGGTAAG 


559 


NMJX)16S4 


ARAFI 


GTOCCCACATTCCAAGTCA 


560 


NMJ)01634 


ARAFl 


GAATGAGATGCAGGTGCTC 


561 


NM_001dS4 


ARAFI 


GTTCCACCAGCATTGTTCC 


562 


NMJ)016S4 


ARAFl 


CCTCTCrrGGAATTTCTGCC 


563 


NMJ501654 


ARAFI 


AGTGAAGAACCTGGGGTAC 


564 


NM_001654 


ARAFl 


TTGAGCTGCTGCAACGGTC 


565 


NIVe000435 


NOTCH3 


GAACATGGCCAAGGGTGAG 


566 


NM_000435 


N0TCH3 


GAGTCTGGGACCrCCTTCT 


567 


NM_000435 


NOTCH3 


AATGGCTrcCGCTGCCrCT 


568 


NM_000435 


NarcH3 


TGATCACTGCTTCCCCGAT 


569 


NM_000435 


N0TCH3 


TGCCAACTGAAGAGGATGA 


570 


NM_0OO435 


NOTCH3 


GCTGCTGTTGGACCACnT 


571 


NM.024408 


N0TCH2 


CCAAGGAACCItiCnTGAT 


572 


NM„024408 


N0TCH2 


GACrCAGACCACTGCTTCA 


573 


NM„024408 


N0TCH2 


CnTGAATGCCAGGGGAAC 


574 


NM.024408 


NorrcH2 


GCAACTTTGGTCTCCnTC 


S75 


NM_O24408 


NonrcH2 


GAGACAAGTTAACTCGTGC 


576 


NM„024408 


Norcm 


GCAATTGGCTCjTGATGCTC 


577 


NM_012193 


FZD4 


CCATCTGCTTGAGCTACTT 


578 


NM„012193 


FZD4 


TrGGCAAAGGCTCCTTGTA 


579 


NM_012193 


FZD4 


AGAACCTCGGCTACAACOr 


580 


NM_012193 


FZD4 


TCGGCTACAACGTGACCAA 


581 


NM_012193 


FZD4 


GTTOACnTACCrGACGGAC 


582 


NM_.0I2193 


FZD4 


TCCGCATCrCCATGrroCCA 


583 


NM^007313 


ABLl 


GAATGGAAGCCTGAACTGA 


584 


NlvL.007313 


ABLi 


CAAGTTCrCCATCAAGTXX: 


585 


NM«0073I3 


ABLl 


CTAAAC3GTGAAAAGCTC0G 


586 


NMJ[N)7313 


ABU 


TOCTGCSCAAGAAAGCTTGA 


587 


NMJX)73I3 


ABLl 


AAACCTCTACACGTTCTGC 


588 


NMJ)073I3 


ABLl 


AGACATCATGGAGTCCAGC 


589 


NM_0174I2 


FZD3 


CAGATCACrcCAGGCATAG 


590 


NM_017412 


FZD3 


ATGTGTGGTGACTGCTTTG 


591 


NNC0174I2 


FZD3 


AGAGATGGGCATTGTTrCC 


592 


NM.0174I2 


FZD3 


AGCATTGCTGTrrCACGOC 


593 


NM.0I74I2 


FZD3 


GCTCATGOAGATOnTGOT 


594 


NM_005633 


SOS! 


TGGTGTCCTTXjAGGmxyrC 


595 


NMJ0Q5633 


SOSl 


TATCAGACCGGACCTCTAT 


596 


NM.00S633 


SOSl 


CTTACAAAAGCGAGCACAC 


597 


NMJ005633 


SOSl 


GAACACCGTTAACACCTCC 


598 


NMJ)05633 


SOSl 


ATAACAGGAGAGATCCAGC 


599 


NM.005633 


SOSl 


ATTGACCACCAGGnTCTG 


600 


NM_C054I7 


SRC 


CAATTCGTCGGAGGCATCA 


601 


NMj0O54i7 


SRC 


GCAGTGCCTGCCTATGAAA 


602 


NM_0054!7 


SRC 


GGGGAGTTTGCTGGACTTT 


603 


NM^005400 


PRKCB 


GATCGAGCTGGCrGTCrrT 


604 


NK1_005400 


PRKCE 


GCrCACX:ATCrGAGGAAGA 


605 


NM_005400 


PRKCE 


GGTCTTAAAGAAGGACGTC 


606 


NM^OOS400 


PRKCB 


TCACAAAGTGTGCTGGGTT 


607 


NML0O5400 


PRKCE 


CCAGGAGGAATTCAAAGGT 


608 


NMJX)5400 


PRKCE 


TGAGOACGACCTATnXjAG 


609 
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NM^002388 


MCM3 


GTCTCACCITCTCCGGTAT 


610 


NM_0023a8 


MCM3 


C3TACATCCATGTGG0CAAA 


611 


NM_002388 


MCM3 


AGGATTITCTGGCCTOCAT 


612 


NML002388 


MCM3 


TGGCTrCATOAAAGCTCCCA 


613 


NM.002388 


MCM3 


TCCAGGTTGAAGGCATrCA 


614 


NM_002388 


MCM3 


GCAGATGAGCAAGGATGCT 


615 


NM^004380 


CREBBP 


GAAAAACGGAGGTCGCGTT 


616 


NM^004380 


CaiEBBP 


GACATCCCGAGTCTATAAG 


617 


NMJ)04380 


CREBBP 


TGGAGGAGAATTACKXXrrr 


618 


NM-004380 


CREBBP 


AirnfUCGGCGCCAGAAT 


619 


NMJ)04380 


CREBBP 


GC:ACAAGGA<iG-lCl'li:rrC 


620. 


NM_004380 


CREBBP 


GAAAACAAATGCCCCGTGC 


621 


NMJX)62]9 


PUQCB 


CAAAGATGCCXrrrCTGAAC 


622 


NM.006219 


PIK3CB 


GTGCACATTCCTGCTGTCT 


623 


NM_0062t9 


PIK3CB 


AAGTTCATGTCAGGGCTGG 


624 


NMJ0O6219 


PIK3CB 


AATC3CGCAAATTCAGCGAG 


625 


NMJ006219 


PIK3CB 


AATGAAGCCnTGTGGCTG 


626 


NM_006219 


PIK3CB 


TACAGAAAAGTTTGGCCGG 


627 


NM_006218 


PIK3CA 


CTAGGAAACCTCAGGCTTA 


628 


NM_006218 


PIK3CA 


TrCAGCTAGTACAGGTCCn* 


629 


NM_006218 


PIK3CA 


TGATGCACATCATGGTGGC 


630 


NM_006218 


PIK3CA 


AGAAGCTGTGGATCrTAOa 


631 


NM_0062I8 


PIK3CA 


AGGTGCACTOCAGTTCAAC 


632 


N\L0062I8 


P1K3CA 


TGGCnTGAATCTTTGGCC 


633 


NM 002086 


QRB2 


CTGGTACAAGGCAGAGCrr 


634 


NKL.002086 


GRB2 


CGGGCAGACCGGCATGTTr 


635 


NM 002086 


GRB2 


CCGGAACGTCTAAGAGTCA 


636 


NMJD02086 


GRB2 


ATACGTCCAGGCCCTCTTT 


637 


NM^002086 


GRB2 


TGAGCTGGTGGATrATCAC 


638 


NM„002086 


GRB2 


TGCAGCACITCAAGCrrGCT 


639 


NMJD01982 


£RfiB3 


TGACACrrGGAGCCTGTGTA 


640 


NMJX)1982 


£RfiB3 


CrrAGACCTAGACCTAGACT 


641 


NMJX>1982 


ERBB3 


CmrrCTGAATGCXjGAGCCT 


642 


NMJ00J982 


ERBB3 


OAGGATGTCAACGOrTATG 


643 


NM_001982 


ERBB3 


CAAAGTCTTGGCCAGAATC 


644 


NM_O0I982 


ERBB3 


TACACACACCAGAGTGATG 


645 


NMJ)01903 


CTNNAl 


CGTrCCGATCCTCTATACT 


646 


NMJ001903 


CTNNAI 


AAGCCATTGCnXjAAGAGAG 


647 


NMJ)01903 


CTNNAl 


TGTCrrCATTCKTCTCCAAO 


648 


NN1_001903 


CTNNAl 


ACSCAGTGCrrGATGATAAGG 


649 


NM_001903 


CTNNAl 


TGACCAAAGATGACCTGTG 


650 


NM_001903 


CTNNAl 


TGACATCATr(jTGCnX3GCC 


651 


NM 003600 


STK6 


CACCCAAAAGAGCAAGCAG 


652 


NMJ)O36O0 


STK6 


GCACAAAAGCTTGTCrCCA 


653 


NM 003600 


STK6 


CCrCXXTATTCAGAAAGCT 


654 


NM_003600 


STK6 


ACAGTCTTAGGAATCGTGC 


655 


NM_003600 


STK6 


OACnTGAAATTGGTCGCC 


656 


NM 003600 


STK6 


TTGCAGATnTGGGTGOTC 


657 


NM_003I61 


RPS6KBI 


GACACTGCCTGcmTACT 


658 


NM_003I6i 


RPS61CBI 


CTcrcAcrrGAAAcrrGCCAA 


659 


NM_003161 


RPS6KB1 


CKriTTTCCCATGATCTCCA 


660 


NM_003161 


RPS6ICB1 


TTGATTCCrCGCGACATCT 


661 


NML003161 


RPS6KB1 


GAAAGCCAGACAACTTCTG 


662 


NM.003161 


RPS6KB1 


CTTGGCATGGAACATTGTG 


663 


AF308602 


NOrCHl 


GATCGATCGCTACGAarGr 


664 


AF308602 


NOTCHI 


CAcrrACACCTCTorrGTCiC 


665 


AF308602 


NOTCHl 


. AGGCAAGCCCFGCAAGAAT 


666 


AF308602 


NOrCHi 


CATCCCCTACAAGATOCSAG 


667 


AF308602 


NOTCHl 


ATATCC5ACGATTGT0CAGG 


668 


AF3086aZ 


NOTCHl 


ATrCAACGGGCrCTTGTGC 


669 


NMJ01623I 


NUC 


CCACTCAGCTCAGATCATG 


670 


NMJ)16231 


NLX 


GCAATGAGGACAGCTTGTG 


671 


NMJ0I6231 


NUC 


TGTACjCnTTCCACTGGACrr 


672 


NM.01623I 


NUC 


TCTCCTTGTGAACAGCAAC 


673 


NM^01623l 


NUC 


GGAAACAGAGTGCCrcTCT 


674 


NM_0J6231 


NUC 


TCTGGrrCTCTTGCAAAAGG 


675 


NM.001253 


CDC5L 


AAGAAGACGrrCACKX3ACA 


676 


NM.001253 


CDC5L 


AAAAAGCCrrGCCCITGGTr 


677 


NM_00I253 


CDC5L 


TCATTGGAAGAACAGCGGC 


678 


NMJ)0339l 


WNT2 


GTTCjTCTCAAAGGAGCTTTC 


679 


NMJ)03391 


WNT2 


GCCTCAGAAAGGGATTGCT 


680 
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NIVL003391 


WNT2 


AGAAOATGA^TCGTCrGGC 


681 


N\L003391 


WNT2 


GCICrOOATCSTGCACACAT 


682 


NM_003391 


WNT2 


AACGGGCQjVTTATCrcTGG 


683 


NM_003391 


WNT2 


ATTTGCCCGCGCATTTGTG 


684 


NML002387 


Mcx: 


AGTTGAGGACKmTCTGCA 


68S 


NM_002387 


Mcx: 


GACTTAGACCTGGGAATCT 


686 


NM_002387 


MCC 


GGATTATATCXACJCAGCTC 


687 


NM_0023S7 


MCC 


GAGAATGAGAGCCTGACTG 


688 


NM_002387 


MCC 


TAGCTCTGCTAGAGGAGGA 


689 


NM„002387 


MCC 


ACAGAACGGOTOAATAGCC 


690 


NMJ005978 


SlOOAl 


GGAACTTCTCjCACAAGGAG 


691 


NM_005978 


S100A2 


GGGCCCAGGACTCnTGATG 


692 


NM-00S978 


SIO0A2 


TGAGAACAGTGACCAGCAG 


693 


NM^O05978 


S100A2 


TGGCACrCAXCACTGnrCAT 


694 


NM_005978 


Si00A2 


GACOGACCCTGAAGCAGAA 


695 


NM_O05978 


SI00A2 


ITOCAGGAGrTATGCTGnT 


696 


NM_0333«) 


KRAS2 


GAAGTTATGGAATrCCrrT 


697 


NM.033360 


ICRAS2 


GGACTCTGAAGATGTAOCT 


698 


NM_033360 


KRAS2 


GGCATACTACTrACAAGTGG 


699 


NM_033360 


KRAS2 


Accraicivi 'iggatattc 


700 


NM_033360 


KRAS2 


TAAATGrGATTTGCCTTCT 


701 


NM_033360 


KRAS2 


GAAAAGACrCCrGGCTOra 


702 


NM^I39049 


MAPK8 


GGAATAGTATGCGCAGCTT 


703 


NM^i39049 


MAPK8 


gtgattcaoatggagctag 


704 


NM_I39049 


MAPK8 


CACCATGTCCrrGAATTCAT 


705 


NM_1 39049 


MAPK8 


CGAGTnTATGATGACGCC 


706 


NM_1 39049 


MAPK8 


CACCCGTACATCAATGTCT 


707 


NM^l 39049 


MAPK8 


TCAAGCACCTTCATTCTGC 


708 


NM_002658 


PLAU 


CAAGTACTTCTCCAACATr 


709 


NM_002658 


PLAU 


GAGCTGGTXrrCTGATTGTr 


710 


NML002658 


PLAU 


CTGCCCAAAGAAATTCGGA 


7U 


NNL0D2658 


PLAU 


GTGTAAGCAGCTGAGGTCT 


712 


NNL.002658 


PLAU 


tggaggaacatgtgtgtcc 


713 


NM^002658 


PLAU 


TTACTOCAGGAAOCCAGAC 


714 


NM^016195 


MPHOSPHl 


agaggaactctctgcaagc 


715 


NM^0I6195 


MPHOSPHl 


AAGTTTGTGXCCCAGACAC 


716 


NM.016195 


MPHOSPHl 


ctcaagaagotactgcttg 


717 


NM_016195 


MPHOSPHl 


GACATGCGAATOACACTAO 


718 


N\L.0J6195 


MPHOSPHl 


aatggcagtgaaacaccct 


719 


NM.016195 


MPHOSPHl 


ATGAAGGAGAGTGATCACC 


720 


NMJQ20168 


PAK6 


(XjACATCCAGAAGTTGTCA 


721 


NM„020168 


PAK6 


GAOAAAGAA-TCGGGTCGGT 


722 


NM_020168 


PAK6 


TGACGAGCAOATTGCXACT 


723 


NNL.000051 


ATM 


TAGATrCnTCCA<K3ACACG 


724 


NM 000051 


ATM 


AGTrCGATCAOCAGCTGrrr 


725 


NM_00005I 


a™ 


GAAGTTOGATnGCCAGCTGT 


726 


NMJ)0I259 


CDK6 


TCTTGGACCXGATTGGACT 


727 


NMJ0OI259 


CDK6 


ACCACAGAACATTCTGGTG 


728 


NMJ)DI259 


CDK6 


AGAAAACCrOGATrcCCAC 


729 


NM^004856 


KNSL5 


QAATCrrGAGCCTAGAGTGG 


730 


NMJ04856 


KNSLS 


CCATR3GTTACTGACGTGG 


731 


NM_004856 


KNSL5 


AACCCAAACCTCCACAATC 


732 


NM^006845 


KNSL6 


ACAAAAACOGAGATCCGTC 


733 


NM_006845 


KNSL6 


GAATTrCGGGCTACnTGG 


734 


N\L.006845 


KNSL6 


ATAAGCAGCAAGAAACGGC 


735 


NMJ04972 


JAK2 


AGCCGAOTTGTAACTATCC 


736 


NM^004972 


JAK2 


AAGAACCTGOTGAAAGTCC 


737 


NM_004972 


JAK2 


GAAGTX3CAG<:AGGTTAAGA 


738 


NM^005026 


PIK3CD 


GATCGGCCACTTCCrnTC 


739 


NM^005026 


PIK3CD 


AGAOATCTGGGCCnrCATGT 


740 


NM_005026 


PIK3CD 


AACCAAAGTGAACTGGCTG 


741 


NM_0I4885 


APCIO 


CAAGGCATCCGTTATATCT 


742 


NM_0I4885 


APCIO 


ACCAGGATTTGGAGTGGAT 


743 


NM_0I4885 


APCIO 


GTGGCTGGATTCATCnTCC 


744 


NM^005733 


RAB6KIFL 


GAAGCTGTCCCTGCTAAAT 


745 


NM.005733 


RAB6KIFL 


CrCTACCACTGAAGAGTro 


746 


NM_005733 


RAB6KIFL 


AAGTGGGTCGTAAGAACCA 


747 


NM_007054 


KIF3A 


GGAGAAAGAlTCCCTTTGAG 


748 


NML007054 


KIF3A 


TATTGGGCCAGCAGATTAC 


749 


NM_007054 


KIF3A 


TTATGACGCTAGGCCACAA 


750 


NMJJ20242 


KNSL7 


GCACAACrcCTGCAAATTC 


751 
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NM_020242 


KNSL7 


GATGGAAGAGCCTCTAAGA 


752 


NM^020242 


KNSL7 


ACGAAAAGCrCCTTGAGAG 


733 


NM_0OU84 


ATR 


TCACGACrCGCTGAACTGT 


754 


NH.00I184 


ATR 


GAAACrrGCAGCTATCTTCC 


755 


NNL00n84 


ATR 


GTTACAATGAGGCTGATGC 


756 


NM_014S75 


KEFM 


AnrrCTAGAAAACGGTAA 


757 


NM_014875 


KIFM 


GAGGGGCGAACTTTCGGCA 


758 


MM 014875 


KIF14 


CTGGGACCGGGAAGCCGGA 


759 


NM 014S7S 


K[Fi4 


CrTCTACTTCTGTTGGCAG 


760 


NMj014875 


KIF14 


ACTTACTATTCAGACTGCA 


761 


NM 01487S 


KIF14 


GCCCTCACCCACAGTAGCC 


762 


NM 014875 


KIF14 


CAGAGGAATGCACACCCAG 


763 


NM ni^R75 


KIF14 


GATTGATTAGATCrCITOA 


764 


NM 014875 


KIF14 


GTGAGTATTATCCCAGTTG 


765 


NM 014875 


KIFt4 


ATCTGGGGTGCTGATTGCT 


766 


MM 014875 


KIF14 


GTGACAGTGGCAiGTACGCXS 


767 


NM 014875 


KIF14 


TCAGACTGAAGTTGTTAGA 


768 


NM ni4R7S 


KIF14 


GTrGGCTAGAATrGGGAAA 


769 




KIF14 


GAAGACCATAGCATGCGCC 


770 


MKif ftftlO?^ 




TtSCCTGAAAGAGACTTGfTO 


771 


NM flOl'Hd 


CHEKI 


ATCOATTCTGCTCCTCTAG 


772 




CHEKl 


CTGAAGAAGCAGTCGCAGT 


773 




CHEK2 


GATCACAGTGGCAATGGAA 


774 




CHEK2 


ATGAATCCACAGCTCTACC 


775 


MM nvtxUA, 


CHEK2 


AAACTCTTGGAAGTGGTGC 


776 




TP53 


GCACCCAGGACTTCCATTT 


777 




TP53 


CCTCTTGGrrCGACCTTAGT 


778 




TP53 


TGAGGCCTTGGAACItZAAG 


779 






AGCGCCTGGGCCTGGATGA 


780 




PRKCB 


AiCOGGGCAGCATCGTCTCC 


781 




PRKC£ 


CAGCGGCCAGAGAAGGAAA 


782 


MM flTK^n 


PRKCE 


CAGAAGGAAGAGTGTATGT 


783 


NM nngitnn 


PRKCB 


TGCAGTGTAAAGTCTGCAA 


784 






GCGCATCGGCCAAACGGCC 


785 


MM nn^4/ifl 


PRKCE 


ATTGCAGAGACTTCATCTG 


786 


MM nn^^iYk 


PRKCE 


GAAGAGCCGGTACTCACCC 


787 


ktkji AA^nn 


PRKCE 


AGTACTGGCCGACCrGGGC 


788 




PRKCE 


GGATGCAGAAGGTCACrcC 


789 


MM fY)'^^^m 


PRKCE 


CGTGAGCTTGAAGCCCACA 


790 


INM_JUU3**UU 


ri\ivv>c> 


CACAAAGTGTGCTGGGTTA 


791 


IN Wl^UU J*tUU 


PRKCE 


GACGAAGCAATTGTAAAGC 


792 


MM fVK»nn 




CACCCrrCAAACCACGCAT 


793 






nTT^AGPATCTTGAAAGCTT 


794 


INJVl_UUj*tUU 


PRKCB 


CAACCOAGGAGAGGAGCAC 


795 


7MM nrt<k/tnn 

IN 1V1_V>U J*Hni 


PRKCE 


TACATTGCCCTCAATGTGG 


796 


MM nn^^D 


PRKCE 


GAGGAATCGCCAAAGTACT 


797 


MM OfV^dnft 


PRKCE 


GGGATTTGAAACTGGACAA 


798 


MM nnf\9tft 


PiK3CA 


1TACACGTTCATGTGCTGG 


799 


MM nnfi7iR 


PIK3CA 


CACAATCCATGAACAGCAT 


800 


MM 


P1K3CA 


CAATCAAAOCTGAACAGGC 


801 


NM An^91R 


PQGCA 


CACFrrCAACAGCCACACAC 


802 




PIK3CA 


GTGTTACAAGGCTTATCTA 


803 




PIK3CA 


GATOCTATGGTrCGAGGTT 


804 


MM nrwoiR 




CmCI^AAATAATGACAAGCA 


805 






At •1*1*1 YVYTPITfY^AnTY^f 
/\v> J Jl I Vl^w 1 1 1 VAiin I 1 1 \Jw 


806 


MM IVMOIft 




AGAATATCAGGGCAACyrAC 


807 


MM fHMOlfl 


PIK3CA 


TTnGATXTTTCCACACAATT 


808 




PIK3CA 


AGTAGGCAACOGTGAAGAA 


809 


NM 006^18 


PIK3CA 


CAGCKjCrTGCTGTCTCCTC 


810 




PIK3CA 


GAGCCCAAGAATGCACAAA 


811 


NM 006218 


PIK3CA 


GCCAGAACAAGTAATTGCT 


812 


NMJ0O62I8 


PIK3CA 


GGATGCCCTACAGGGCTTO 


813 


NMJ}06218 


PIK3CA 


TCAAATTATTCGTATTATO 


814 


NMJJ062I8 


P1K3CA 


gaattqgagatx:gtcacaa 


815 


NM^006218 


PIK3CA 


TGAGCrGGTGCGAAATTCT 


816 


NM.006218 


P1K3CA 


QATTTACGGCAAGATATGC 


8J7 


NMJ006218 


PIK3CA 


TGATGAATACTTCCTAGAA 


818 


NMJ0O1982 


ERBB3 


GCTGCTGGGACTATGCCCA 


819 


NMJQ01982 


ERBB3 


ATCTGCACAATTGATGTCT 


820 


NM.00I982 


ERBB3 


CnTGAACTGGACCAAGCr 


821 


NM 001982 


ERBB3 


CATCATGCCCACTGCAGGC 


822 
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N\L001982 




ERBB3 


AACnTCCAGCTGOAACCC 


823 


NM_001982 




ERBB3 


TGAAGGAAATTAGTGCrCG 


824 


NM 001982 




ERBB3 


AATTCGCCAGCGGTrCAGG 


82S 


NML00I982 




ERBB3 


ACCAGAGCTTCAAGACTGT 


826 


NM_001982 




ERBB3 


GAGGCTACAGACTCTGCCT 


827 


NM_001982 




ERBB3 


TGGAGCCACAACTAGACCT 


828 


NM_00I982 




ERBB3 


ACACTGTACAAGCTCTACG 


829 


NM_001982 




ERBB3 


TAATGGTCACrGCTnPGGG 


S30 


NM.001982 




ERBB3 


ACAGCJCACTCCTGGAGATA 


831 


NM^001982 




ERBB3 


CmTAGGACAAACACTGGT 


832 


NM.00I982 




ERBB3 


GATTACTGGCATAGCAGGC 


833 


NH.001982 




ERBB3 


ATGAATACATGAACTGGAG 


834 


NMJ)019S2 




ERBB3 


CACTTAATCGGOCAOGTGG 


835 


NMJX)1982 




ERBB3 


GGCCTGTOCrcCTGACAAG 


836 


NM 001981 




ERBB3 


TCTGCGGAGTCATGAGGGC 


837 


NMJ001982 




ERBB3 


TAGACCTACACTTGGAAGC 


838 


NM 004283 




RAB3D 


GATTrCAGGTCTCCCTGTC 


839 


NM 004283 




RAB3D 


GCCACAGTGGTTATCTCCA 


840 


NM_004283 




RAB3D 


GCAATCCCTTCCCTCCTGT 


841 


NM_004283 




RAB3D 


TCTCTGATCCTGAAGTGAA 


842 


NM_004283 




RAB3D 


CATCAATGTGAAGCAGGTC 


843 


N!VL0O4283 




RAB3D 


CATGAGCTTGCTGCTTTCC 


844 


NMJ)04283 




RAB3D 


AACGTGTTCirGOCTGCTCA 


845 


NM 004283 




RAB3D 


CTGCTTTCCAGGGTGTGTT 


846 


NM 004283 




RAB3D 


GCGGCCAGGGCCAAGCCCX: 


847 


NM 004283 




RAB3D 


CTTCTAGCTTAGAACCATT 


848 


NM 004283 




RAB3D 


CAGGGTGTGTrGAGGGTGG 


849 


NM 004283 




RAB3D 


CIXriTIVlX^AGGrCCrGCA 


850 


NMJ004283 




RAB3D 


CTTGTGCCAAGATGGCATC 


851 


NMJ)04283 




RAB3D 


GCACCATCACCACGOCCTA 


852 


NM_004283 




RAB3D 


CGCGGACGACTCCTTCACT 


853 


NM_004283 




RAB3D 


TCATCCAGGGAAGGCGGCG 


854 


NM_004283 




RAB3D 


GACACTGACGTGCATGAGC 


855 


NM_004283 




RAB3D 


CCCIXXXJAGGCCClOriTA 


856 


NM_004283 




RAB3D 


AGGTCTTCGAGCGCCTGGT 


857 


NM 004283 




RAB3D 


CCTCTTTCTCAGGTCCTGC 


858 


NM 003620 




PPM ID 


TTGCCCGGGAGCACTTGTQ 


859 


NM 003620 




PPMID 


CGTGTGCGACGGGCACGGC 


860 


NM 003620 




PI^ID 


ATTAGGTCTTAAAQTAGTT 


861 


NM 003620 




PPMID 


AGCCCTGACTTTAAGGATA 


862 


NM 003620 




PPMID 


TGTGGAGCCCGAACCGACG 


863 


NM 003620 




PPMID 


GCGACGGGCACGGCGGGCG 


864 


NM 003620 




PPMID 


GATTATATGGGTATATA1T 


865 


NM 003620 




PPMID 


TTAGAAGGAGCACAGTTAT 


866 


NMJ0O362O 




PPMID 


CCGGCCAGCCGGCCATGGC 


867 


NM 003620 




PPMID 


GAGCAGATAACACTAGTGC 


868 


MM 003620 




PPMID 


AGATOCCATCTCAATGTOC 


869 


NM 003620 




PPMID 


GCGGCACAGTTTGCCXXiGG 


870 


NM 003620 




PPMID 


CGTAGCAATQOCl'TC'rCAG 


871 


NM 003620 




PPMID 


TATATGGGTATATATTCAT 


872 


NM 003620 




PPMID 


GCTGCTAATTCCCAACATT 


873 


NM 003620 




PPMID 


ACAACTGCCAGTGTGGTCA 


874 


NM_003620 




PPMID 


TTGACCCTCAGAAGCACAA 


875 


NM 003620 




PPMID 


GTCTTAAAGTAGTTACTCC 


876 


NM_003620 




PPMID 


ATGCrcCOAGCAGATAACA 


877 


NM.003620 




PPMID 


GCGCCTAGTGTGTCTCCCG 


878 


NM„022048 




CSNKIGI 


TAGCCATCCAGCTGCTTTC 


879 


NMJQ22048 




CSNKIGl 


TTCTCATTGCAAGGGACrc 


880 


NM_022048 




CSNKIGI 


CACGCATCTTGGCAAAGAG 


881 


NM_022048 




CSNKIGI 


TAGCC'lGGAGUACriGriT 


882 


NM„022048 




CSNKIGI 


ACTCAATTCTACCTGCAGC 


883 


NM^Q22048 




CSNKIGI 


CTAAGTGCTGCTXjTTTCTT 


884 


NM„022048 




CSNKIGI 


GCAAAGCOGGAGAGATGAT 


885 


NM_Q22048 




CSNKIGI 


ccrcrrcACAGACCTCTTr 


886 


NM_022048 




CSNKIGI 


GAAGGGACrCCrCTTTGGG 


887 


NK.022048 




CSNKIGI 


GAGAGCTCAGATTAGGTAA 


888 


NM^022048 




CSNKIGI 


CACGTAGATTCrOGrGCAT 


889 


NM„022048 




CSNKIGI 


ATGAGTAnTACGGACCCT 


890 


•NM_022048 




CSNKIGI 


GGTGGGACCCAACrrCAGG 


891 


NM^a22048 




CSNKIGI 


AGAGCTGAATGTTGATGAT 


892 


NM 022048 




CSNKIGI 


GATTCrGOrGCATCTGCAA 


893 
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NM 022048 


CSNKIGI 


AACfTCAGGGTrcOCA^GA 


894 


NM 022048 


CSNKIGI 


TCTCGAATGGAATACGTGC 


895 


NM 022048 


CSNKiGl 


CCGAGGAGAGTGGGAAATT 


8% 


NM 022048 


CSNKIGI 


GGGAGOCCACTCCAATGCA 


897 


NM 022048 


CSNKIGI 


GTCAAGCCAGAGAACTTCC 


898 




CKNl 




899 


NM 000032 


CKNl 


ATGTGAGAAGAGCATCAGG 


900 




CKNl 


AGrAGTGTGTTCCATTGGC 


901 




CKNl 




902 




CKNl 


rAr!rAnTGATGAAGAA.GGA 


9(B 




CKNl 


GATAArTATnCTTAAGGGA 


904 




CKNl 


TnOACTTCAfYTTDCTCACT 


90S 




CKNl 




905 


MM fWirMW 
IN lTi_Uln/UO £ 


CKNl 


AGGAACTTTATACrrGGTAG 


907 




CKNl 


AAr«rGATGGAfTIt!Ar!CTC 


908 


INiVI^UUUUoZ 


CKNl 




909 




CKNl 


G A AGGGAfiATAPATlTrTAT 


910 


rim_UUUUaZ 






01 1 


MX* nfVTAOO 




AT A T<7)rY"TYV A GmW^ A P 




iNMJUUUU&c 






71 J 


iNMJUUUUoZ 


#^K"M1 


TY. A A AfTT ATTifiGAT AP A A A 


014 








7l J 






'm*Ar' AGGGTr* AP An A €^ A A 




NMJUUUUoZ 




G AnrW^ATT" AfTTATTYw ACT 


017 


NMJIXMIBZ 






0151 






AflTJAnriAGnPGA AnG AGAP 


010 




PTTRJ 


PT A PfTTP A PP APP APn*^ An 






r irt\J 


TPPPPT A A TTPP A A A G^ A A 


921 


JNMJUU2c>4i 


ri rivJ 


P A A GT A TGT A GT A A A rJP AT 






r IrKJ 


A APPTPGTPArPPTTPTYSP 


09^ 


NM_UU25S43 


PTPRJ 


PAr'APA An/TTY^fyr^rrs^AT 


094. 


NM_.U02o4J 


PTPRJ 


TYVS A ATTPT A GPfY» ATYli^l A A 


09S 


NMju0284J 


PTPRJ 


ATA A AP AG A ATGO A APTGTI 




NMJUUze43 




fnrjnAnAnPTGPTPp'TTrT 


097 


NMJ0UZ843 


iyi*DO f 


A APTTT A AriTTP.PP An A AP 


095) 


NM_uOZ84j 


Pl'PRJ 


AP AP APnVSP AP ATPTnTTGP 


090 


NM_002o43 


CTTDD f 


<^ A rsT" A ap A/V7Pr*pp A np A 


7JU 


NM^0Uz843 


PTPRJ 


TIY^ A AP APPP A AP A APY^ A A 




NMJD02843 


rlrKJ 


ATT ATVTTTn APTA A ATP;iY» 


019 


NMJ0U2843 


rlrKJ 


TPAPTPA APAPTPA An APT 


OH 


NMjOuZ843 


rlrKJ 


A A/'** r piv^/iPTry a p ap^^p a 




NMJ002843 


prpRJ 


PPPP A P A PP APnPTGTTPP 
VJUCCAUACCACVIVJ 1 VJ I. IV^V^ 


01^ 

7JJ 


NMjDu2843 


FTPRJ 


Tfir*Ar»Tir:r3 A APPmppprfTPA 

1 CAC 1 VJVjAACC 1 VJVJCC VIVJA 


7J0 


NM_0u2»43 


PTPRJ 


A P A P A PP A PPP A PPTY^PP A 
ACACAVjVjAVjVjVJAVA-. 1 VJfVJV^/a 


937 


Kill J #W)0>l'l 


PTPRJ 


TfyrTTTC ATTTG ATP AP3GG 
IVJI l^lV'Al 1 I VIA 1 WAVJVIVJ 


938 


NMJ004037 




TP ATPPGGG AG A AGT A P AT 
1 \^A 1 ^v>ViViVJAV7AAVJ 1 rW> A I 


939 


. NM JU04U37 




APPP A APTATAPP A APvTi A A 
AV^V^V^ AAV^ I A 1 Av.V^AAVJVfA/\ 


940 


XTkjf AA/I/YM 


/\ivirux 


PPTGP ATG A APP AG A A.GPA 

V»'V> 1 VJv»A 1 VJAAVA^ AvJ/v^VJV»^ 


941 




AMPfY> 


PrnCGGOAGGTCTl' lU AGA 


942 




AMpn? 


GPPTTTTTG ATGTGT Af^PO 


943 


KTiv/f nrwnn 


AMpn? 

rUVirU^ 


G AP A AP ATG AG A A ATfrrrfi 

VJ Av<'AAV<-A 1 VJAVJ/VV\ I V_-VJ I VJ 


OAA 
7*f*r 


INM^UWUj / 


AMPn7 
rvlviri/^ 


GPP ArPP AGTG A A AGP* A A A 
VJV^V'AVvVrfV^ AVJ I VJAAAVJV^ AAA 


04^ 




AKHDnO 
AIVlrL/^ 


P APPA AP APPTTTP ATV^PP 
CAvJvJAACAC 1 1 ICCAl V-VjC 


7*tO 


NMJLRWUj/ 




TPTPnP A P A PPP A PPTY7PP 
1 VJ I uOUAU AUUC AvjC X uCC 




NM^UU4037 




VJVjCu IvjAACAvjACvjC I OCvJ 


OAS 


KTX>I fV\A nil 

NM„UO4037 


AMrU^ 


A A ATATlPrY~TTTA A/^A APT' 
AaAIAICCCI 1 lAAvj/V./\uC 


7*l'7 


NMJUU41/37 




AJTA A AGAGPP AfTYSGPTirSI*. 


zfJU 






PGfTPPTGP ATY? A APPA fSA A 
V»VJ I V.W 1 VJV-A I VJAAV*V./WyAA 




INIVIJUIfIUj / 




GmnACCAACAACAGC^nr* 

VJV^ 1 V#AVJV»AA>Vi.AAV AVJVvV* ■ V. 


952 






PAPATrATPAAGGAGrSTGA 

V-AV^A > V'A I V» AAVJVIAVJVJI 1 VJA 


953 


NM rtft4m7 






954 


MM nOdjCWI 




AAGPTrAGCTCCrnCGATA 


955 




AMPD2 


TGPG AT ATGTGTTI A GPTGn 


956 


NM_004037 


AMPD2 


CTGGGCCCATCCACCACCr 


957 


NM^004037 


AMPD2 


GAAGGACCAGCTAGCCTGG 


958 


NM_016218 


POLK 


TATnCATTTCTTGTCAkAT 


959 


NM_016218 


POLK 


GACGAGGGATGGAGAGAGG 


960 


NM_016218 


POLK 


AGTAGATTGTATAGCrTTA 


961 


NMJ)16218 


POLK 


TATAGATAACTCATCTAAA 


962 


NM_016218 


FOLK 


AAOAACnTGCAOrTGAGCT 


963 


NM.016218 


POLK 


GAATTAOAACAAAGCCGAA 


964 
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NMjaifi218 


FOLK 


TOTGCTATCAATGAGrrCT 


965 




POLK 


ACACCTGACGAGGGATGOA 


966 


NM-016218 


POLK 


TGCATCTACACriTCATCT 


967 


NMJ016218 


POLK 


ACACAGCTGACGAGOGATG 


968 


NM.016218 


POLK 


TOGATAGCACAAAGGAGAA 


%9 


NM^0162]8 


POLK 


AGGGTOCATCAGTCTGGAA 


970 


NM_016218 


POLK 


TATAGCnTACTACATACT 


971 


NM_0i62l8 


POLK 


TCnrrCTACTGCAGAAGAA 


972 


NM_016218 


POLK 


GTrCTTrcTACTGCAGAAG 


973 


NM_0162I8 


POLK 


CTGACAAAGATAACrrrrGT 


974 


NM„016218 


POLK 


GCATCAGTCTGCAAGCCTT 


975 


NVL016218 


POLK 


CTCZAGGATCTACAGAAAGA 


976 


NM^0162I8 


POLK 


AAGGAGATTTGGTGTrCGT 


977 


NM_0I62i8 


POLK 


TAGTGCACATTGACATGGA 


978 
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All references cited herein are incorporated herein by reference in their entirety and 
for all purposes to the same extent as if each individual publication or patent or patent 
application was specifically and individually indicated to be incorporated by reference in its 
entirety for all purposes. 

Many modifications and variations of the present invention can be made without 
departing fix>m its spirit and scope, as will be apparent to those skilled in the art. The specific 
embodiments described herein are offered by way of example only, and the invention is to be 
limited only by the terms of the appended claims along with the full scope of equivalents to 
which such claims are entitled. 
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